Unleashing the Power of Neural Discourse Parsers - A Context and Structure Aware Approach Using Large Scale Pretraining

RST-based discourse parsing is an important NLP task with numerous downstream applications, such as summarization, machine translation and opinion mining. In this paper, we demonstrate a simple, yet highly accurate discourse parser, incorporating recent contextual language models. Our parser establishes the new state-of-the-art (SOTA) performance for predicting structure and nuclearity on two key RST datasets, RST-DT and Instr-DT. We further demonstrate that pretraining our parser on the recently available large-scale “silver-standard” discourse treebank MEGA-DT provides even larger performance benefits, suggesting a novel and promising research direction in the field of discourse analysis.


Introduction
Discourse parsing is an important upstream task within the area of Natural Language Processing (NLP) which has been an active field of research over the last decades. In this work, we focus on discourse representations for the English language, where most research has been surrounding one of the two main theories behind discourse, the Rhetorical Structure Theory (RST) proposed by Mann and Thompson (1988) or interpreting discourse according to PDTB (Prasad et al., 2008). While both theories have their strengths, the application of the RST theory, encoding documents into complete constituency discourse trees (Morey et al., 2018), has been shown to have many crucial implications on real world problems. A tree is defined on a set of EDUs (Elementary Discourse Units), approximately aligning with clause-like sentence fragments, acting as the leaves of the tree. Adjacent EDUs or sub-trees are hierarchically aggregated to form larger (possibly non-binary) constituents, with internal nodes containing (1) a nuclearity label, defining the importance of the subtree (rooted at the internal node) in the local context and (2) a relation label, defining the type of semantic connection between the two subtrees (e.g., Elaboration, Background). In this work, we focus on structure and nuclearity prediction, not taking relations into account. Previous research has shown that the use of RST-style discourse parsing as a system component can enhance important tasks, such as sentiment analysis, summarization and text categorization (Bhatia et al., 2015;Nejat et al., 2017;Hogenboom et al., 2015;Gerani et al., 2014;Ji and Smith, 2017). More recently, it has also been suggested that discourse structures obtained in an RST-style manner can further be complementary to learned contextual embeddings, like the popular BERT approach (Devlin et al., 2018). Combining both approaches has shown to support tasks where linguistic information on complete documents is critical, such as argumentation analysis (Chakrabarty et al., 2019). Even though discourse parsers appear to enhance the performance on a variety of tasks, the full potential of using more linguistically inspired approaches for downstream applications has not been unleashed yet. The main open challenges of integrating discourse into more NLP downstream tasks and to deliver even greater benefits have been a combination of (1) discourse parsing being a difficult task itself, with an inherently high degree of ambiguity and uncertainty and (2) the lack of large-scale annotated datasets, rendering the initial problem more severe, as data-driven approaches cannot be applied to their full potential.
The combination of these two limitations has been one of the main reasons for the limited application of neural discourse parsing for more diverse downstream tasks. While there have been neural discourse parsers proposed (Braud et al., 2017;Yu et al., 2018;Mabona et al., 2019), they still cannot consistently outperform traditional approaches when applied to the RST-DT dataset, where the amount of training data is arguably insufficient for such data-intensive approaches.
In this work, we alleviate the restrictions to the effective and efficient use of discourse as mentioned above by introducing a novel approach combining a newly proposed large-scale discourse treebank with our data-driven neural discourse parsing strategy. More specifically, we employ the novel MEGA-DT "silver-standard" discourse treebank published by Huber and Carenini (2020) containing over 250,000 discourse annotated documents from the Yelp'13 sentiment dataset (Tang et al., 2015), nearly three orders of magnitude larger than commonly used RST-style annotated discourse treebanks (RST-DT (Carlson et al., 2002), Instructional-DT (Subba and Di Eugenio, 2009)). Given this new dataset with a previously unseen number of full RST-style discourse trees, we revisit the task of neural discourse parsing, which has been previously attempted by Yu et al. (2018) and others with rather limited success. We believe that one reason why previous neural models could not yet consistently outperform more traditional approaches, heavily relying on feature engineering (Wang et al., 2017), is the lack of generalisation when using deep learning approaches on the small RST-DT dataset, containing only 385 discourse annotated documents. This makes us believe that using a more advanced neural discourse parser in combination with a large training dataset can lead to significant performance gains. Admittedly, even though MEGA-DT contains a huge number of datapoints to train on, it has been automatically annotated, potentially introducing noise and biases, which can negatively influence the performance of our newly proposed neural discourse parser when solely trained on this dataset. A natural and intuitive approach to make use of the neural discourse parser and both datasets ("silver-standard" and gold-standard) is to combine them during training, pretraining on the large-scale "silver-standard" corpus and subsequently fine-tuning on RST-DT or further human annotated datasets. This way, general discourse structures could be learned from the large-scale treebank and then enhanced with human-annotated trees. With the results shown in this paper strongly suggesting that our new discourse parser can encode discourse more effectively, we hope that our efforts will prompt researchers to develop more linguistically inspired applications based on our discourse parser. Our contributions in this paper are: (1) We propose a novel neural discourse parsing architecture which combines multiple lines of previous work in a single framework.
(2) We combine a large-scale "silver-standard" treebank (MEGA-DT) with small, domain-specific gold-standard treebanks in a neural way during the training process, by initially pretraining on the large (domain-independent) dataset and subsequently fine-tuning on the dataset within the domain itself.
(3) We apply the neural discourse parser on two commonly used disocurse treebanks (RST-DT and Instruction-DT), showing large performance improvements of our model over previous state-of-the-art approaches.

Related Work
The field of discourse parsing has been mainly dominated by traditional machine learning approaches, frequently outperforming initial attempts to apply deep learning and neural networks to the task. Independent of the specific approach used, three general methodologies have been followed to learn discourse trees from small datasets, such as RST-DT (Carlson et al., 2002) or Instructional-DT (Subba and Di Eugenio, 2009): (1) Top-down discourse parsers, splitting the document into non-overlapping text-constituents starting from the representation of the complete discourse down to individual EDUs, assigning the two resulting sub-spans a nuclearity attribute and predicting the relation holding between the sub-trees . (2) Bottom-up parsing, starting from the discourse-segmented list of EDUs and aggregating two adjacent units in every step. This approach is mostly realized using the CKY dynamic programming strategy to obtain optimal trees as in Joty et al. (2015) and Li et al. (2016) or using a greedy method (Hernault et al., 2010). (3) A frequently used and more locally inspired approach of bottom-up discourse parsing using the linear shift-reduce framework, adopted from previous work in syntactic parsing. While the current traditional state-of-the-art discourse parser by Wang et al. (2017) uses the bottom-up shift-reduce method predicted by two separate Support Vector Machines (SVMs) for structure/nuclearity prediction and relation estimation, neural models (Yu et al., 2018;Braud et al., 2017) utilize multi-layer perceptrons (MLP) for classifying possible actions. In our work we also follow the bottom-up shift-reduce strategy, with the detailed description of our system provided in the following section.
Besides the active research area on discourse parsing, a second line of work has emerged recently, trying to generate large-scale discourse treebanks through automated annotations from downstream tasks, such as sentiment analysis (Huber and Carenini, 2020), text classification (Liu and Lapata, 2018), summarization (Liu et al., 2019a) and fake news detection (Karimi and Tang, 2019). The majority of these approaches follows the intuition that discourse trees can be inferred from downstream tasks by predicting latent representations during the learning process of the task itself (Liu and Lapata, 2018;Karimi and Tang, 2019;Liu et al., 2019a) in an end-to-end manner. However, recent work by Ferracane et al. (2019) has shown that the trees resulting from this approach are not only poorly aligned with general discourse structures, but furthermore are oftentimes too shallow. A rather different strategy has been employed by Huber and Carenini (2020), trying to explicitly generate a discourse augmented treebank through distant supervision from sentiment annotations in combination with Multiple-Instance Learning (MIL) and a CKY bottom-up tree generation approach. While the resulting MEGA-DT dataset has only been proposed and released recently, the authors show promising results in their work, reaching the best inter-domain performance when comparing their dataset against RST-DT and the Instruction-DT. This leads us to believe that their treebank does not only learn sentiment-related information, but can also be used to infer general discourse structures on a large scale. We will further evaluate this in section 4.7.

Neural Discourse Parsing
We follow the well-established bottom-up shift-reduce aggregation principle, as previously shown effective for traditional discourse parsers such as (Ji and Eisenstein, 2014;Wang et al., 2017) as well as neural approaches (Braud et al., 2017;Yu et al., 2018). In this section, we will first introduce the general principle of shift-reduce parsing and define the necessary data structures and actions available to our system. Based on the general description, we will subsequently describe the approach taken to execute a single step in the linear-time model.

General Shift-Reduce Architecture
The transition-based shift-reduce parsing architecture traditionally consists of two data structures (a queue and a stack), interacting through a set of possible actions categorized into shift and reduce actions. This architecture is illustrated in Figure 1(a, b).
The Queue initially contains the EDUs of the complete document in the natural, sequential order, obtained either from manual annotation or off-the-shelf discourse segmenters (e.g. (Li et al., 2018)). Depending on the action performed by the parser, the top element on the queue is either read or moved to the stack. At the end of the parsing process, the queue must be empty.
The Stack represents the previously processed part of the document (also in natural order). At the beginning of the process, the stack is empty and is subsequently filled and aggregated according to the system's actions. After the parsing process is completed, the stack contains the complete, aggregated document as a single discourse tree.
The Shift operation delays aggregations of sub-trees at the beginning of the document by popping the top input node (EDU) off the queue and pushing it onto the stack. The shift-reduce algorithm needs to set hard constraints to only allow shift operations if the queue still contains unprocessed nodes.
The Reduce-X operation is used to aggregate the top two partial trees (S 1 , S 2 ) on the stack into a single representation (S 1,2 ). For complete RST-style discourse trees, each reduce action needs to further define a nuclearity assignment X N ∈ {NN, NS, SN} to the sub-tree covering S 1,2 and a relation X R ∈ {Elaboration, Contrast, ...} holding between them. In this work, we limit the scope of the reduce action to solely predict the nuclearity assigment X N , as the MEGA-DT treebank currently only provides partial discourse trees, not incorporating the relation assignment.
While the specific implementation of the stack and queue components are mostly fixed, the selection of shift-and reduce-actions can be realized with rule-based approaches (Marcu, 2000) or a variety of machine learning models, such as Support Vector Machines (SVMs) (Ji and Eisenstein, 2014) or neural classifiers (Yu et al., 2018). The shift-reduce action selection classifier used in this work is explained below.

Shift-Reduce Action Classifier
In this work, we take advantage of recent success of the BERT-inspired models on the task of language modeling, employing the distilled version  of the large-scale RoBERTa (Liu et al., 2019b) language model as the base for our action classifier. At each step in the shift-reduce framework, we predict the next action by encoding the top two elements of the Stack (S 1 and S 2 ) and the top element of the Queue (Q 1 ) (Wang et al., 2017) using the RoBERTa model as well as additional structural features.
RoBERTa-based semantic features: To align with the input format requirements of the RoBERTa model, we first encode the joint representation of the three components under consideration (namely S 1 , S 2 and Q 1 ) into a single string with denoting the concatenation and following the standard RoBERTa syntax, [CLS] and [SEP ] representing the sequence-classification and end-of-sequence tokens respectively. Please note that S 1 , S 2 and Q 1 are not the typically used dense representations of sub-trees or EDUs, but solely contain the textual representation of a sub-tree or EDU as a flat sequence of words.
Since the input sequence-length of the RoBERTa language model is by default bound by 512 tokens and the elements on the stack (S 1 , S 2 ) represent increasingly large sub-trees (and therefore text-spans) with every additional reduce operation executed, we restrict the length of S 1 and S 2 to a maximum of 240 words each. Specifically, if one of the constituents exceeds 240 tokens, the concatenation of the 120 leading and trailing words is chosen as the span's representation. The decision to retain the leading and trailing parts of a span comes from the observation that those parts often contain important cue words which signal explicit discourse relations (Prasad et al., 2008) 1 . Furthermore, if the length of S 1 or S 2 is below the maximum number of 240 tokens, it is padded with [MASK] tokens to preserve absolute positions in the RoBERTa input, which is important since RoBERTa uses absolute positional embeddings. The top element on the queue (Q 1 ) is truncated or extended to the leftover capacity of 28 tokens in the same way, which clearly suffices for Queue-elements, only containing single EDUs which are 13 words on average for the RST-DT corpus and 7 words for MEGA-DT. Hence, the remaining space of 480 tokens is split equally among the stack elements, as described above. The embedding c for the current stack and queue configuration is computed as follows: with c = v[0] ∈ R 768 encoding the full span representation at the index of the [CLS]-token.
Structural features: Previous successful approaches to traditional discourse parsing (Joty et al., 2015;Ji and Eisenstein, 2014) have shown that the structural organization of a document into sentences and paragraphs plays a crucial role when predicting discourse, with Joty et al. (2015) giving strong intuition for their usefulness by showing that less than 5% of discourse subtrees violate sentence boundaries in the RST-DT corpus. To explicitly model these structural features, we use the same organizational features as Wang et al. (2017) to determine where a span is positioned within the document as well as relative to adjacent constituents. More specifically, for all three spans S 1 , S 2 and Q 1 , we extract two sets of features: (1) Whether the span is at the beginning/end of a sentence/paragraph/document and (2) For each pair of adjacent spans ((S 2 , S 1 ) and (S 1 , Q 1 )) we compute the features indicating whether the pair is within a single sentence/paragraph. For the three constituents under consideration, this results in an ordered sequence o of 28 values. Adding a distinct embedding layer with 10 neurons u i = emb(o i ) on top of each value o i in the sequence results in a concatenated dense representation u = u 1 , ..., u 28 ∈ R 280 for the structural features. Whenever a feature cannot be computed (for example when the Queue is empty or the Stack contains a single element), we represent it as a zero-vector.
Action classification: To predict the next shift-reduce action during the tree-generation process, the semantic and structural features described above are concatenated (see Figure 1(c)) and fed into a twolayer MLP with an intermediate GeLU (Hendrycks and Gimpel, 2016) activation and a final softmax layer (see eq. 2, 3) for each of the four possible actions (Shift, Reduce NN , Reduce NS , Reduce SN )

Neural Shift-Reduce Training Procedure
We minimize the weighted cross-entropy loss in every parsing step individually, allowing for (1) more fine-grained optimization and (2) more parallelizable training. The training loss is thereby computed between the unnormalized prediction l (see eq. 2) and the respective gold-label y ∈ {Shift, Reduce N N , Reduce N S , Reduce SN }. Since there is only a single shift but three reduce actions, we weight the four output classes by factors [ 3 6 , 1 6 , 1 6 , 1 6 ] to equally penalize an incorrect shift/reduce action. At test time, the document tree structure is constructed greedily by selecting the action with the highest probability (see eq. 3) at each parsing step.

Experiments
We first introduce the three datasets we used to train and evaluate our model against strong baselines (section 4.1). Further, we describe how to effectively combine diverse treebanks in a neural manner using pretraining and fine-tuning, now possible with our neural discourse parser (section 4.2). Our model hyperparameters and the respective search spaces on the development set are presented in section 4.3, followed by the baseline models in section 4.4. The experiments section will then introduce the metrics used in this paper (section 4.5), discuss insights gathered in preliminary evaluations (section 4.6) and finally present results aggregated in section 4.7.

Datasets
This work relies on three distinct RST-style discourse treebanks for the English language.
RST-DT is the largest human-annotated RST-style discourse parsing corpus (Carlson et al., 2002), consisting of news articles from the Wall Street Journal. The treebank contains 347 documents in the training-and 38 in the test-set. We further split the training portion into 90% training data and 10% development data to perform hyper-parameter and architecture optimization.
Instructional Dataset is the second human-annotated RST-style corpus used in this work, containing 176 documents. The Instructional Dataset (short: Instr-DT) is used to train and evaluate discourse parsers in the home-repair instructions domain (Subba and Di Eugenio, 2009). During preprocessing, we combine multi-rooted documents using a sequence of right-branching decisions with an N-N nuclearity assignment. We randomly separate the data into a 90% training-and a 10% test-portion. Please note that our training/test split is consistent across all models except the CODRA model described below, where we report the original results published by the authors using 10-fold validation.
MEGA-DT is the first successful automatically generated "silver-standard" discourse treebank obtained by applying distant supervision on the large-scale Yelp'13 sentiment dataset (Tang et al., 2015). Recently published by Huber and Carenini (2020), the treebank contains ≈250,000 documents with full RST-style discourse trees encompassing structure and nuclearity attributes. The MEGA-DT corpus has been shown to achieve superior performance when compared to human-annotated datasets (including RST-DT) on the discourse domain-transfer task. Due to computational limitations we pretrain our model on a 52.5k subset of MEGA-DT, with 50k trees used for training and 2.5k datapoints left out for the development set.

Combining Treebanks
In a first stage of our experiments, we will verify the effectiveness of our proposed architecture by training and evaluating the parser on individual datasets. Additionally, we will pretrain the parser on MEGA-DT until it converges on the development portion and then fine-tune on RST-DT and Instr-DT to evaluate the usefulness of pretraining on silver-standard discourse trees.

Hyperparameters and Training Setup
The hyperparameters in our model are heavily influenced by previous findings. For the RoBERTa model (Liu et al., 2019b), we use the pre-trained distilled version proposed in  with 6 layers containing 12 attention heads and a hidden size of 768, as implemented by . The structural features used as inputs for the classification module are encoded as 10-dimensional embeddings for each of the 28 organizational features. During training we use the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 0.001 and a weight decay value of 0.01 for both pretraining and fine-tuning. We further apply gradient norm clipping at 0.2 (Pascanu et al., 2013). The learning rate was scheduled as in Vaswani et al. (2017), using 4000 warm-up steps. Due to the variable size trees in the training data, we aggregate documents with identical number of EDUs into batches of size 20 during pretraining and 5 for fine-tuning. All model configurations are trained by early stopping if the performance of neither structure nor nuclearity improves over 3 consecutive epochs on the development dataset. Our models are trained using PyTorch (Paszke et al., 2019) on a GTX 1080 Ti GPU with 11GB of memory. Our code and model-checkpoints will be made publicly available with the publication of this paper 2 .

Baselines
To evaluate the performance of our model in the context of RST-style discourse parsing, we compare it against a variety of previously proposed, competitive baselines: The DPLP parser (Ji and Eisenstein, 2014) is a traditional discourse parser utilizing an SVM-classifier within the shift-reduce framework solely based on linear projections of lexical features. The CODRA model (Joty et al., 2015) uses an optimal CKY-based chart parser in combination with Dynamic Conditional Random Fields (CRF), separated on sentence-level. The gCRF model (Feng and Hirst, 2014) follows a similar approach but utilizes a greedy strategy. The Two-Stage parser proposed by Wang et al. (2017) is the current SOTA system on the RST-DT structure prediction. The model uses two separate linear SVM classifiers. We use the public codebase 3 provided by Wang et al. (2017) and remove the relation classification module for our experiments. Transition-Syntax: the parser by Yu et al. (2018) is a neural shift-reduce parser utilizing LSTMs to generate EDU embeddings. In addition, they apply a neural dependency parser for extracting syntactic features. Cross-Lingual is the neural shift-reduce approach by Braud et al. (2017), utilizing RST treebanks from multiple languages and the Top-Down-Generative parser by Mabona et al. (2019) is a recent top-down transition-based neural generative parser employing Tree-LSTM for encoding subtrees on the stack.

Metrics
As suggested in a recent literature analysis by Morey et al. (2017), we use the original Parseval measure to compare the micro-average F1-scores of our model with our selected baselines. To further allow additional comparisons, we also report the results with respect to RST-Parseval, currently still more commonly used in recent literature.

Preliminary Evaluation
In our preliminary experiments, we evaluate a set of modelling decisions on the held-out development set, influencing the design of our final model. We obtained three useful insights during this phase: (1) Adding padding to the three classifier input strings used in the RoBERTa model, extending each of them to the maximum defined length of 240 words for the two stack elements S 1 and S 2 and padding the top element on the queue (Q 1 ) to 28 words substantially enhanced the performance of the component. We believe this is likely to be the case because RoBERTa internally uses absolute positional embeddings.
(2) Following the intuition that an incorrect shift-and reduce-action should be penalized similarly (independent of the nuclearity label), we found that weighting the loss function as described in section 3.3 boosts the model's performance on both, the structure and nuclearity metric.  (3) We experimented with different ways for summarizing the outputs of RoBERTa (see Figure 1(c)) into a single vector. After experiments with simple and attention-based averaging of RoBERTa outputs, we found these approaches to produce sligtly worse results compared to simply using the output vector corresponding to the [CLS] token.

Results
The final results of our evaluation are presented in Tables 1 and 2. The two tables contain slightly different subsets of competitive discourse parsers from previous work, depending on the metric on which the original authors evaluate their models. The reported scores were either taken from the original paper or the literature survey by Morey et al. (2017) 4 . The left-most column in the first sub-table contains the models in the comparison, showing previously proposed baselines along with two versions of our new neural discourse parser, with and without pretraining. The center column contains the evaluation results on the structure prediction task for the two test datasets (RST-DT and Instr-DT). The right-most column shows the performance for each of the models on the nuclearity prediction task, again subdivided for the two evaluation datasets. The second sub-table contains the results for various ablations of our model and the bottom sub-table shows human results on the tasks. For all of our models we report the average performance as well as the standard deviation on each metric over five independent runs. We compare the two versions of our neural discourse parser against the best-performing, previously proposed model for each of the four prediction tasks, meaning we compare the performance of our model, for example on the RST-DT dataset, against the current SOTA model on RST-DT structure-prediction by Wang et al. (2017) and for nuclearity against Yu et al. (2018). The best performing baseline on the Instr-DT dataset is the CODRA model (Joty et al., 2015). Please note again the different evaluation procedure on Instr-DT that was used for CODRA model (Joty et al., 2015). For a more direct comparison, please see our Instr-DT results against the Two-Stage parser (Wang et al., 2017), which utilizes the same split as ours.
When examining our final evaluations shown in Tables 1 and 2 it becomes clear that our newly proposed neural discourse parser reaches the highest performance on all measures except the structure prediction on the Instr-DT dataset. We observe that our model strongly outperforms the SOTA approach on the RST-DT structure prediction by Wang et al. (2017). Furthermore, pretraining on the MEGA-DT treebank leads to further improvement with respect to the mean scores over independent runs.
On the Instr-DT dataset, our parser achieves a result similar to the model of Joty et al. (2015) on the structure prediction task and substantially outperforms the SOTA baseline on the nuclearity measure when pretraining is applied. Nonetheless, a particularly important result is that our system produces consistently strong performance across multiple domains, which neither of the top-performing traditional systems (Wang et al., 2017;Joty et al., 2015) managed to demonstrate. This serves as an indication that employing large-scale language models alleviates the need for extensive manual feature engineering employed by these systems for RST discourse parsing.
In addition, we perform an ablation study of our system to analyze the importance of each individual component of our parser. The first row in the second sub-tables illustrates the results when only organizational features are used, while the second row shows the impact of removing the features and only using RoBERTa for the action classification. Finally, the third row contains the performance of our system with organizational features and a randomly initialized RoBERTa model component.
We obverse that removing the organizational features results in a noticeable drop in performance, implying the importance of encoding the document structure explicitly. Unsurprisingly, removing the RoBERTa feature extractor leads to a large performance drop, far below the competitive baselines, since this version of our system does not take discourse connective words into account. Finally, we demonstrate the importance of LM pretraining and pretrained word embeddings in the last row of the ablation sub-table. While this system performs on par with traditional systems in respect to structure prediction, most likely because of the organizational features, it demonstrates inferior performance on the nuclearity prediction task, which (even in the easiest scenarios) requires knowledge of more high-level concepts, such as sentence coordination and subordination. In more advanced cases, it plausibly requires knowledge about the author's communicative goal. The overall difficulty of this task is reflected in the relatively low human evaluation scores shown in the last row. Our results can be summarized as follows: (1) Our proposed approach achieves state of the art performance on both the RST-DT and the Instr-DT datasets.
(2) Applying large-scale language models leads to stronger results and higher domain adaptivity in RST discourse parsing. (3) Pretraining the discourse parser on the large-scale "silver-standard" MEGA-DT treebank enhances the performance and supports the ability of the neural parser to generalize across multiple datasets and domains.

Conclusions
In this work, we proposed a rather simple, yet highly effective discourse parser, utilizing recent neural BERT-based language models in combination with structural features. The integration of those inputfeatures within a standard shift-reduce framework as well as an unprecedented use of recent large-scale "silver-standard" discourse parsing datasets for pretraining reaches a new state-of-the-art performance on both, the RST-DT and Instr-DT treebanks. We show that our new, neural discourse parser already achieves better or similar performance when trained and evaluated on the RST-DT and Instr-DT datasets, however, the consistent and significant SOTA result is reached when incorporating pretraining on the MEGA-DT corpus. The presented pretraining approach on the silver-standard MEGA-DT dataset also further validates the usefulness of additional supervision for this task and calls for more work in that area.

Future Work
As directions for future work, we plan to run experiments with larger language models, as our lightweight RoBERTa model only contains 82M parameters, while top-performing language models such as BERT-Large utilize an order of magnitude more parameters. We also want to explore neural parsing strategies besides the shift-reduce framework trained on the large-scale MEGA-DT treebank, comparable to the recently proposed top-down neural discourse parsing architecture by Mabona et al. (2019). Further, we plan to extend our framework to also predict relations, generating complete discourse trees. Another line of future work is to evaluate the effectiveness of pretraining on other, more shallow discourse analysis frameworks and datasets such as PDTB, only containing flat discourse trees, which might potentially require new approaches for silver-standard annotation. Lastly, in the short term we plan to overcome our computational limitations and pretrain our model on the full MEGA-DT. After that, we also want to generally venture into even larger pretrained treebanks generated according to Huber and Carenini (2020), taking into account more diverse sentiment datasets (such as IMDB (Diao et al., 2014) and Amazon reviews (Zhang et al., 2015)) to extend the size and generality of the pretraining approach and eventually enhance the overall performance of our parser.