MEGA RST Discourse Treebanks with Structure and Nuclearity from Scalable Distant Sentiment Supervision

The lack of large and diverse discourse treebanks hinders the application of data-driven approaches, such as deep-learning, to RST-style discourse parsing. In this work, we present a novel scalable methodology to automatically generate discourse treebanks using distant supervision from sentiment-annotated datasets, creating and publishing MEGA-DT, a new large-scale discourse-annotated corpus. Our approach generates discourse trees incorporating structure and nuclearity for documents of arbitrary length by relying on an efficient heuristic beam-search strategy, extended with a stochastic component. Experiments on multiple datasets indicate that a discourse parser trained on our MEGA-DT treebank delivers promising inter-domain performance gains when compared to parsers trained on human-annotated discourse corpora.


Introduction
Discourse parsing is an important Natural Language Processing (NLP) task, aiming to uncover the hidden structure underlying coherent documents, as described by theories of discourse like Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) or PDTB (Prasad et al., 2008). Not only has discourse parsing been shown to enhance key downstream tasks, such as text classification (Ji and Smith, 2017), summarization (Gerani et al., 2014) and sentiment analysis (Bhatia et al., 2015;Nejat et al., 2017;Hogenboom et al., 2015), but it also appears to complement contextual embeddings, like BERT (Devlin et al., 2018), in tasks where discourse information is critical, such as argumentation analysis (Chakrabarty et al., 2019).
Traditionally, RST-style discourse parsing builds a complete, hierarchical constituency tree for a document (Morey et al., 2018), where leaf nodes are clause-like sentence fragments, called elementarydiscourse-units (EDUs), while internal tree nodes are labelled with discourse relations (e.g., Evidence, Contrast). In addition, each node is given a nuclearity attribute, which encodes the importance of the node in its local context.
A key limitation for further research in RST-style discourse parsing is the scarcity of training data. Only a few human annotated discourse treebanks exist, each only containing a few hundred documents. Although our recent efforts using distant supervision from sentiment to generate large-scale discourse treebanks have already partly addressed this dire situation (Huber and Carenini, 2019), the previously proposed solution is still limited in: (i) Scope, by only building the RST constituency structure without nuclearity and relation labels; and (ii) Applicability, by relying on a non-scalable CKY solution, which cannot be applied to many real-world datasets with especially long documents.
In this work, we propose a significant extension to this line of research by introducing a scalable solution for documents of arbitrary length and further moving beyond just predicting the tree-structure by incorporating the nuclearity attribute, oftentimes critical in informing downstream tasks (Marcu, 2000;Ji and Smith, 2017;Shiv and Quirk, 2019). Inspired by the recent success of heuristic search in NLP tasks involving trees (e.g., Fried et al. (2017); Mabona et al. (2019)), we develop a beam-search strategy implementing an exploration-exploitation trade-off, as commonly used in reinforcementlearning (RL) (Poole and Mackworth, 2010).
Remarkably, by following this heuristic approach, any large corpus annotated with sentiment can be turned into a discourse treebank on which a domain/genre specific discourse parser can be trained. As a case study for this process, we annotate, evaluate and publicly release a new discourseaugmented Yelp '13 corpus (Tang et al., 2015) called MEGA-DT 1 (comprising ≈250,000 documents) with nuclearity attributed "silver-standard" discourse trees, solely leveraging the corpus' document-level sentiment annotation.
To evaluate the quality of our newly proposed MEGA-DT corpus, we conduct a series of experiments. We train the top-performing discourse parser by Wang et al. (2017) on MEGA-DT and compare its performance with the same parser trained on previously proposed treebanks. Specifically, we compare our discourse-annotated dataset against a smaller "silver-standard" treebank (Huber and Carenini, 2019) containing around ≈100,000 documents with ≤20 EDUs and two standard human annotated corpora in the news domain (RST-DT) (Carlson et al., 2002) and in the instructional domain (Subba and Di Eugenio, 2009).
Results indicate that while training a parser on MEGA-DT does not yet match the performance of training and testing on the same treebank (intradomain), it does push the boundaries of what is possible with distant supervision. In most cases, training on MEGA-DT delivers statistically significant improvements on the arguably more difficult and useful task of inter-domain discourse prediction, where a parser is trained on one domain and tested/applied to another one.
Overall, this suggests that our new approach to distant supervision from sentiment can generate large-scale, high-quality treebanks, with MEGA-DT being the best publicly available resource for training a discourse parser in domains where no gold-standard discourse annotation is available.

Related Work
The most closely related line of work is RST-style discourse parsing, with the goal to obtain a complete discourse tree, including structure, nuclearity and relations. Based on the observation that these three aspects are correlated, most previous work has explored models to learn them jointly (e.g., Joty et al. (2015); Ji and Eisenstein (2014); Yu et al. (2018)). However, while this strategy seems intuitive, the state-of-the-art (SOTA) system on structure-prediction by Wang et al. (2017) applies a rather different strategy, first jointly predicting structure and nuclearity and then subsequently pre-1 Our new Discourse Treebank and the code to generate further "silver-standard" discourse treebanks can be found at: https://www.cs.ubc.ca/ cs-research/lci/research-groups/ natural-language-processing/ dicting relations. The main motivation behind this separation is the large number of possible output classes when predicting these three aspects together. The success of the system by Wang et al. (2017) on the widely used RST-DT corpus inspires us to also learn structure and nuclearity jointly, rather than combining all three aspects.
The second line of related work infers finegrained information from coarse-grained supervision signals using machine learning. Due to the lack of annotated data in many domains and for many real-world tasks, methods to automatically generate reliable, fine-grained data-labels have been explored for many years. One promising approach in this area is Multiple Instance Learning (MIL) (Keeler et al., 1991). The general task of MIL is to retrieve fine-grained information (called instance-labels) from high-level supervision (called bag-labels), using correlations of discriminative features within and between bags to predict labels for instances. With the recent rise of deep-learning, neural MIL approaches have also been proposed (Angelidis and Lapata, 2018).
We previously combined the two lines of related work described above to create discourse structures from large-scale datasets, solely using document-level supervision (Huber and Carenini, 2019). When applied to the auxiliary task of sentiment analysis, we generated a "silver-standard" discourse structure treebank using the neural MIL model by Angelidis and Lapata (2018) in combination with a sentiment-guided CKY-style treeconstruction algorithm, generating near-optimal discourse trees in bottom-up fashion (Jurafsky and Martin, 2014). Although our approach has shown clear benefits, it is inapplicable to many real-world datasets, as it does not scale to long documents and cannot predict nuclearity-and relation-labels. Addressing these limitations is a major motivation of this paper.
Further efforts to automatically generate discourse trees from auxiliary tasks have been mostly focused on latent tree induction, generating trees from text classification (Karimi and Tang, 2019) or summarization tasks (Liu et al., 2019). For both approaches, domain dependent discourse trees are induced during the neural training process. While either method has shown to improve the performance on the downstream task itself, subsequent research by Ferracane et al. (2019) indicates that the induced trees are often trivial and shallow, and do not represent valid discourse structures.
The third stream of related work is on leveraging heuristic search algorithms in NLP tasks involving trees. For syntactic parsing, Vinyals et al. (2015) and Fried et al. (2017) show that a static, small beam size (e.g. 10) already achieves good performance, with Dyer et al. (2016) delivering promising results by using greedy decoding. As a recent example for discourse parsing, Mabona et al. (2019) successfully combine standard beam-search with shift-reduce parsing using two parallel beams for shift and reduce actions. Overall, recent work shows that beam-search approaches and their possible extensions can effectively address scalability issues in multiple parsing scenarios. In this paper, we extend the standard beam-search approach with a stochastic exploration-exploitation trade-off, as used in Reinforcement Learning, where signals also tend to be sparse and noisy.

Predicting Discourse from Sentiment
Previous work has shown that incorporating RSTstyle discourse trees can help to predict documentlevel sentiment (Bhatia et al., 2015;Nejat et al., 2017;Hogenboom et al., 2015). These findings give rise to the assumption that the sentiment of a document can also provide important information on its discourse structure. In the following sub-section, we shortly revisit our previous approach to exploit this assumption by solely relying on document-level sentiment annotations (Huber and Carenini, 2019). Afterwards, we will present our new approach to overcome scalability issues and jointly predict structure and nuclearity.

Predicting Discourse Structure for Short Documents
The discourse-structure tree of a document can be predicted from its global sentiment by combining Multiple Instance Learning (MIL) and the CKY algorithm. We will illustrate the process on the following negative sentiment example with a polarity of −0. To obtain the tuple {p EDU , a EDU } for each EDU in a document, the neural MIL model (Angelidis and Lapata, 2018) is trained on a documentlevel sentiment dataset, with the goal to predict sentiment-labels for EDU-level instances. The model therefore generates a mapping from inputs (EDUs) to the respective outputs (sentimentclasses) by exploiting correlations between the appearance of EDUs in documents and the respective document gold-labels across a corpus. For example, the EDU [had a good taste,] 3 will most likely appear predominantly in positive documents, allowing the MIL model to infer a positive EDU-level sentiment polarity p EDU for this input. When applying the neural model by Angelidis and Lapata (2018), an attention mechanism is internally used to weight the importance of EDUs for the overall document sentiment. An attention-weight a EDU is also extracted for each EDU and subsequently used as an importance score when aggregating subtrees using the CKY approach.
From those tuples {p EDU , a EDU } assigned to leaf-nodes, the sentiment polarity p and attention score a for any internal node in an arbitrary constituency tree can be computed bottom-up by aggregating its two child nodes c l , c r . Out of the set of potential aggregation functions proposed in Huber and Carenini (2019), the best performing approach has shown to be: By recursively applying this function from the leafnodes, we can compute the sentiment and attention of the root node, representing the full document. The process of selecting the best discourse tree for a given document can be framed as finding the tree for which the sentiment of the root node (spanning the whole document) is the closest to the gold-standard sentiment annotation. A brute-force solution to this problem is to generate all possible discourse trees using the general CKY algorithm and selecting the best tree amongst all candidates. However, the computational complexity of this approach quickly explodes, as shown for our running example with 5 EDU leaf-nodes in Figure 1. From the decision-space of possible tree-structures, the tree with the shortest sentiment-distance from the gold-standard, computed at the root node, is selected.
Although this method has been shown to provide reasonably good trees when leveraged for Figure 1: All 14 projective discourse trees annotated with sentiment for a 5 EDU document (using a simplified color-scheme, green = positive, red = negative, grey = neutral, omitting the attention attribute) discourse-structure parsing, it is limited in two fundamental ways: (1) The approach is not scalable. Since its space complexity grows with the Catalan number C n = 1 n+1 2n n for trees with n + 1 EDUs, it can only be applied to short (≈≤ 20 EDUs) documents (see bottom row in Table 1), making it impractical for many real-world datasets containing longer documents, such as the Yelp '13 (Tang et al., 2015), IMDB (Diao et al., 2014) or Amazon Review dataset (Zhang et al., 2015). (2) Due to the high computational complexity of the structure prediction itself, the inference of further RST-tree properties, such as nuclearity and relations, often critical for downstream tasks, are not feasible with this unconstrained CKY approach.

Predicting Discourse Structure and
Nuclearity from Arbitrary Documents Inspired by the recent success in applying beamsearch to enhance the scalability of multiple NLP parsing tasks (Mabona et al., 2019;Fried et al., 2017;Dyer et al., 2016;Vinyals et al., 2015), we propose a novel heuristic beam-search approach that can automatically generate discourse trees containing structure-and nuclearity-attributes for documents of arbitrary length.
Stochastic Beam-Search In essence, the general CKY dynamic programming algorithm creates all possible binary trees covering the n EDUs by internally filling an (n×n) matrix, where each cell(i, j) contains the information on all subtrees covering the text spans from EDU i to EDU j . Our heuristic beam-search solution limits the computational complexity of this process by reducing the number of subtrees stored in each cell to a constant beamsize B. This naturally raises the question on how to select the B subtrees to preserve in each cell. We follow the intuitive assumption that subtrees for which the sentiment diverges most from the overall document sentiment (the only available supervision for this task) can be safely discarded. Out of the set of possible subtrees T for a given cell, only the subset T with |T | = B is kept, containing the B subtrees with the closest sentiment polarity p t i to the gold-label (gl) sentiment of the document. Formally: However, one limitation of this heuristic rule is that it strictly prefers subtrees with sentiment closer to the overall document sentiment, independent of their distance from the root node. This can be problematic when applied in early stages of the tree-generation process, where only a few EDUs are combined. For instance, a mostly positive document might still contain certain negative subtrees at its lowest levels, which also need to be aggregated appropriately.
Ideally, we would like to support a high degree of exploration on low levels of the tree, only loosely forcing the sentiment of subtrees in the beam to align with the overall document gold-label sentiment; while on higher levels of the tree, the requirement of closely reflecting the document's gold-standard sentiment should be strictly enforced (i.e., exploiting the distant supervision).
We implement this strategy through a stochastic beam-search approach, which relies on a softmax selection using the Boltzmann-Gibbs distribution (Poole and Mackworth, 2010). The temperature coefficient τ thereby modulates the exploration-exploitation trade-off (similar to previous work in RL), by influencing the divergence of the softmax outputs. We then sample from the resulting, categorical probability distribution P = ((P rob(t 1 ), ..., P rob(t n )), computed for every local subtree t i ∈ T to obtain a subset T of size B (as shown in equation 3).
In this work, the parameter τ is defined as a linear function f (n, c) parameterized by the number Figure 2: Standard beam-search approach (left) picking the top B = 2 tree-candidates with the smallest distance |p ti − gl| in every CKY cell. Stochastic beam-search approach (right) calculating the Boltzmann-Gibbs distribution with the tree-coverage dependent temperature τ , modulating the subtree sampling process of the tree-candidates. (For readability, we only show a maximum of 4 subtrees per CKY cell) of EDUs c covered under the subtree t i as well as the total number of EDUs n (see equation 4). This way, τ influences equation 3 such that for larger values of c (at the top of the tree), τ gets close to 1 and the sampling is likely to select subtrees with low distance |p t i − gl|. For subtrees with a small coverage c (at the bottom of the tree), τ becomes >> 1 and P rob(t i ) resembles the uniform distribution, allowing for a high degree of exploration. For illustration, Figure 2 highlights the differences between the standard and the stochastic beam-search approach.
Analysis of Spacial Complexity The described system significantly reduces the spatial complexity, independent from whether a stochastic component is used. The complexity reduction can be easily observed by comparing the theoretical upper-bounds for the space consumption of the unrestricted CKY approach (eq. 5) against the upper-bounds for the heuristically constrained CKY method (eq. 6).
In both equations, n represents the number of leaf-nodes (EDUs) in the discourse tree. In eq. 5, the number of generated trees at every level of the tree is bound by the Catalan number, while in eq. 6 the bound has a quadratic dependency on the inputsize and the beam-size. For the equations shown,  Table 1: Upper-bounds for growth of spatial complexity using different beam sizes and unconstrained CKY (∞), assuming 1Byte per unit in memory. KB = 10 3 , MB = 10 6 , GB = 10 9 , PB = 10 15 , SB = 10 54 we assume that on every level of the tree, each of the possible subtrees is represented by 2 pointers to the child-nodes as well as a sentiment and attention value for the subtree itself. Table 1 compares the space capacities required with increasing document length, indicating that with a proper beam size, our heuristic strategy can deal with the tree structures for very long documents.
Integration of Nuclearity With this scalable solution, it is now possible to also take additional properties, like nuclearity, into account. The inherent advantage of generating nuclearity-attributed discourse trees becomes obvious when revisiting the definition in RST (Mann and Thompson, 1988), where the nuclearity-attribute encodes a notion of "importance" in the local context, with Nucleus-Statellite (N-S) and Satellite-Nucleus (S-N) attributions defining the directionality between two nodes, while the Nucleus-Nucleus (N-N) attribution implies equal importance (Morey et al., 2018). Expressing this notion of importance, it is not surprising that nuclearity-attribution is frequently critical in informing many downstream tasks like summarization and text categorization (e.g., Marcu (2000); Ji and Smith (2017); Shiv and Quirk (2019)). Technically, we integrate the nuclearity attribute into the tree-generation process by assigning each subtree one of the three nuclearity classes N-S, S-N or N-N, following the assumption that the attention values a c l , a cr capture the nodes' relative importance in the tree. Starting from the leaf-node attention, extracted from MILNet, we propagate the attention values through the tree structure according to equation (1). More specifically, for a subtree where the attention value a c l is greater than the attention a cr , we will assign the N-S label, while S-N is assigned if the opposite is true. However, in this way, only two of the three possible nuclearity classes can be represented (namely N-S and S-N), as the attention values are distinct. To further account for the third class of N-N, we include  an additional subtree at every merge in the CKY procedure, which averages not only the two attention values a c l , a cr (as shown in eq. 1) but also the child polarity scores p c l , p cr . This reflects the definition of the N-N nuclearity class according to RST, where an even importance for all child nodes is assumed in the multi-nucleus case. The additional complexity of doubling the number of trees in each cell is only manageable due to the use of our heuristic approach.

Evaluation
In this section, we evaluate our proposed method to generate the MEGA-DT discourse treebank by assessing the performance of a discourse parser when trained on MEGA-DT against our previously proposed "silver-standard" treebank (Huber and Carenini, 2019) as well as two commonly used, human-annotated, discourse corpora.

Treebanks
The two human-annotated treebanks are: Instructional-DT (from here on called Instr-DT) by Subba and Di Eugenio (2009), which comprises of documents on home-repair instructions annotated with full RST-style discourse trees, separated into training-and test-set with a 90-10 split. RST-DT by Carlson et al. (2002), containing news articles alongside with full RST-style discourse trees, in the standard 90-10 train-test split.
The two automatically annotated treebanks are: Yelp13-DT, generated according to our previously proposed unconstrained CKY approach as described in Huber and Carenini (2019). We use the pre-segmented version of the Yelp'13 customer review dataset by Angelidis and Lapata (2018), separated into EDUs by applying the publicly available discourse segmenter proposed in Feng and Hirst (2014). Yelp13-DT contains short documents with ≤ 20 EDUs, only considering two nuclearity classes (namely N-S and S-N). MEGA-DT, our novel treebank, is also generated from the original Yelp'13 corpus, akin to Yelp13-DT. However, due to our newly proposed, scalable solution, MEGA-DT is much larger and more comprehensive, integrating all three nuclearity classes. A comparison of the key dimensions of all treebanks used in this work is shown in Table 2.

Discourse Parsers
To interpret our results in the context of existing work, we consider a diverse set of top-performing discourse parsers. Previous work by Morey et al. (2017) compares a set of competitive parsers, including DPLP (Ji and Eisenstein, 2014), gCRF (Feng and Hirst, 2014), CODRA (Joty et al., 2015) and Li et al. (2016). We further add the Two-Stage discourse parser by Wang et al. (2017) and the neural approach by Yu et al. (2018) into our final evaluation. Due to the top performance of the parser by Wang et al. (2017) on the structure-prediction of the widely used RST-DT corpus, and even more importantly, due to the separation of the relation computation from the structure/nuclearity prediction, we use the parser by Wang et al. (2017) in our inter-domain experiments.

Preliminary Evaluation
We run a set of preliminary evaluations on a randomly selected subset containing 10,000 documents from the Yelp'13 dataset. In general, the preliminary evaluation suggests that (1) A beamsize of 10 delivers the best trade-off between computational complexity and performance (out of {1, 5, 10, 50, 100}), when tested according to the distance between gold-label sentiment and model prediction.
(2) We employ a sentence-first aggregation strategy, using sentence-boundary predictions from the NLTK toolkit 2 . By not allowing inter-sentence connections, unless the complete sentence is already represented by a subtree, we reach superior results in the preliminary evaluation compared to exploring the complete CKY space. This is consistent with previous findings showing that sentence boundaries are key signals for tree aggregations (Joty et al., 2015).

Experiments and Results
We train the discourse parser by Wang et al. (2017)    Instr-DT corpora 4 . To verify the ability of the training treebanks to support the discourse parser in extracting domain-independent features of general discourse, we evaluate the performance on the inter-domain discourse parsing task, training the Two-Stage discourse parser on one domain (e.g., Yelp user reviews in MEGA-DT) and evaluating it on documents in a different domain (e.g., news articles in RST-DT). We compare the obtained performances against the classic and arguably easier intra-domain measure (training and testing on documents within the same domain). The results of the final evaluation are summarized and aggregated in three sets of experiments in Table 3. In the first set (on top of Table 3), we show the micro-averaged original Parseval performance (Par.) (Morey et al., 2017) as well as the RST-Parseval measures (R-Par.) of standard linguistic baselines for the structure-and nuclearityprediction task. Regarding the structure prediction (left), we compare the performance when applying a strictly right-or left-branching tree to the data, as well as hierarchical versions of those (right-/left-branching trees on sentence-level combined by right-/left-branching trees on document level). The 4 Trained on an Intel Core i9 (10 Cores, 3.30 GHz) CPU results indicate that the hierarchical right-branching tree resembles the original tree structure the closest on both metrics and either evaluation treebank 5 . As a baseline for the nuclearity prediction task, we compute the majority class on the training corpora. It is important to note that while the linguistic baselines for structure do not require available training data, the majority class measure depends on access to an annotated corpus in the target domain.
The second set of results shows the intra-domain performance of top performing discourse parsers, frequently evaluated against in the past. While all parsers except CODRA (Joty et al., 2015) have been only evaluated on RST-DT, we additionally train and evaluate the Two-Stage parser on the Instr-DT corpus. When comparing the intra-domain discourse parsing performance, the Two-Stage parser reaches the consistently best performance on RST-DT structure prediction, while the discourse parser by Yu et al. (2018) achieves the best results on the RST-DT nuclearity prediction using RST-Parseval. CODRA reaches the best performance on the Instr-5 Note that the performance of the Hierarchical Right-Branching baseline is higher than reported in Huber and Carenini (2019), because of an additional clean-up step required during data preprocessing. The competitive performance of this baseline is most likely attributed to the highly structured nature of the target domains.   Each sample is generated as the average performance over 10 random subsets, drawn from 3 independently created treebanks.
DT corpus when evaluated with RST-Parseval.
The main contribution of this work is placed in the third set of results, where the Two-Stage discourse parser is trained and tested on different, non-overlapping domains (i.e., inter-domain). This task is arguably more useful and significantly more difficult than the task evaluated in the second set, which is reflected in the performance decrease for structure and nuclearity in the first two rows of the sub-table, confirming that the transfer of discoursestructures and -nuclearity between domains is a challenging task. The results presented in the third row of the sub-table show the performance of the Two-Stage parser when trained on Yelp13-DT, containing short documents with limited nuclearity annotations. The approach achieves consistently better performance compared to the first two rows on the inter-domain structure prediction task (For both, original Parseval and RST-Parseval), as we have previously shown in Huber and Carenini (2019). However, only considering two out of three nuclearity classes (N-S and S-N), the system performs rather poorly on the nuclearity classification task. The bottom row of the third sub-table dis-plays the performance of the Two-Stage discourse parser when trained on our new MEGA-DT corpus. Training on MEGA-DT delivers statistically significant improvements over the best inter-domain baseline in all structure prediction tasks. Furthermore, our new system also achieves statistically significant gains on the Instr-DT nuclearity prediction, when evaluated according to the RST-Parseval metric. The nuclearity measure on RST-DT using RST-Parseval is statistically equivalent to the best baseline system. Overall, our MEGA-DT corpus appears to outperform previously published treebanks for inter-domain discourse parsing on every sub-task on at least one competitive metric.
In order to gain deeper insights into the effectiveness of our proposed treebank generation approach, we run a set of four additional evaluations. First, we evaluate the individual components of our system by showing an ablation study in Table 4, starting with the performance of the discourse parser trained with MEGA-DT-Base, a treebank generated with the standard beam-search approach and without integrating nuclearity. Adding each feature separately (+Stoch, +Nuc) we observe improvements on at least one of the sub-tasks; however, the combination of the two components produces the best performing MEGA-DT corpus. Second, we show the performance-trend over increasingly large subsets of MEGA-DT in Figure 3, tested on RST-DT (top) and Instr-DT (bottom). The two trends highlight consistent improvements with increasingly large dataset sizes, suggesting further possible gains with even larger treebanks. Third, we further analyze the nuclearity classification performance in Table 5, which presents four confusion matrices for the discourse parsing output of our MEGA-DT treebank, evaluated according to the original Parseval and RST-Parseval metrics on RST-DT and Instr-DT. The matrices show a potential explanation for the performance-gap between the original Parseval and the RST-Parseval metrics,  identifying the over-prediction of the N-N class, especially for gold-label N-S nuclearities. Further, we frequently misclassify the gold-label N-S nuclearity class as S-N. Lastly, we present an additional qualitative analysis in Appendix A to investigate the strength and potential weaknesses of trees in MEGA-DT. We therefore show three randomly selected trees that closely/poorly reflect the authors gold-label sentiment respectively (see Figure 4 for a teaser). In general, the qualitative analysis shows that trees in MEGA-DT are non-trivial, reasonably balanced, strongly linked to the EDU-level sentiment and mostly well-aligned with meaningful discourse-structures.

Conclusions and Future Work
In this work, we present a novel distant supervision approach to predict the discourse-structure andnuclearity for documents of arbitrary length solely using document-level sentiment information. To deal with the increasing spatial complexity, we apply and compare heuristic beam-search strategies, including a stochastic variant inspired by RL techniques. Our results on the challenging inter-domain discourse-structure and -nuclearity prediction task strongly suggests that the heuristic approach taken (1) enhances the structure prediction task through more diversity in the early-stage tree selection, (2) allows us to effectively predict nuclearity and (3) helps to significantly reduce the complexity of the unrestricted CKY approach to scale for arbitrary length documents.
In conclusion, our new approach allows the NLP community to augment any existing sentimentannotated dataset with discourse trees, enabling the automated generation of large-scale domain/genrespecific discourse treebanks. As a case study for the effectiveness of the approach, we annotate and publish our MEGA-DT corpus as a high quality RSTstyle discourse treebank, which has been shown to outperform previously proposed discourse treebanks (namely Yelp13-DT, RST-DT and Instr-DT) on most tasks of inter-domain discourse parsing. This suggests that parsers trained on our MEGA-DT corpus (or further domain-specific treebanks generated according to our approach) should be used to derive discourse trees in target domains where no gold-labeled data is available.
This work can be extended in several ways: (i) We plan to investigate into further functions for τ to enhance the exploration-exploitation tradeoff. (ii) Additional strategies to assign nuclearity should be explored, considering the excessive N-Nclassification shown in our evaluation. (iii) We plan to apply our approach to more sentiment datasets (e.g., Diao et al. (2014)), creating even larger treebanks. (iv) Our new and scalable solution can be extended to also predict discourse relations besides structure and nuclearity. (v) We also plan to use a neural discourse parser (e.g. Yu et al. (2018)) in combination with our large-scale treebank to fully leverage the potential of data-driven discourse parsing approaches. (vi) Taking advantage of the new MEGA-DT corpus, we want to revisit the potential of discourse-guided sentiment analysis, to enhance current systems, especially for long documents. (vii) Finally, more long term, we intend to explore other auxiliary tasks for distant supervision of discourse, like summarization, question answering and machine translation, for which plenty of annotated data exists (e.g., Nallapati et al. (2016); Cohan et al. (2018); Rajpurkar et al. (2016Rajpurkar et al. ( , 2018). James D Keeler, David E Rumelhart, and Wee Kheng Leow. 1991. Integrated segmentation and recognition of hand-printed numerals. In Advances in neural information processing systems, pages 557-563. A Qualitative Analysis of Generated Discourse Trees The following examples are automatically generated trees from our MEGA-DT corpus. EDU leafnodes are enumerated and can be referenced with the discourse units in the description. The coloursaturation and -hue values represents the sentiment of the nodes, with a dark red (high saturation) representing a strongly negative subtree, white (low saturation) representing a neutral sentiment subtree and a dark green (high saturation) represents a strongly positive subtree. The thickness of edges and the size of nodes represent the attention of the subtree, which is strongly correlated with the subtree nuclearity.