RST Parsing from Scratch

We introduce a novel top-down end-to-end formulation of document level discourse parsing in the Rhetorical Structure Theory (RST) framework. In this formulation, we consider discourse parsing as a sequence of splitting decisions at token boundaries and use a seq2seq network to model the splitting decisions. Our framework facilitates discourse parsing from scratch without requiring discourse segmentation as a prerequisite; rather, it yields segmentation as part of the parsing process. Our unified parsing model adopts a beam search to decode the best tree structure by searching through a space of high scoring trees. With extensive experiments on the standard RST discourse treebank, we demonstrate that our parser outperforms existing methods by a good margin in both end-to-end parsing and parsing with gold segmentation. More importantly, it does so without using any handcrafted features, making it faster and easily adaptable to new languages and domains.


Introduction
In a document, the clauses, sentences and paragraphs are logically connected together to form a coherent discourse. The goal of discourse parsing is to uncover this underlying coherence structure, which has been shown to benefit numerous NLP applications including text classification (Ji and Smith, 2017), summarization (Gerani et al., 2014), sentiment analysis (Bhatia et al., 2015), machine translation evaluation (Joty et al., 2017) and conversational machine reading (Gao et al., 2020).
Rhetorical Structure Theory or RST (Mann and Thompson, 1988), one of the most influential theories of discourse, postulates a hierarchical discourse structure called discourse tree (DT). The leaves of a DT are clause-like units, known as elementary discourse units (EDUs). Adjacent EDUs and higher-order spans are connected hierarchically through coherence relations (e.g., Contrast, Expla-nation). Spans connected through a relation are categorized based on their relative importancenucleus being the main part, with satellite being the subordinate one. Fig. 1 exemplifies a DT spanning over two sentences and six EDUs. Finding discourse structure generally requires breaking the text into EDUs (discourse segmentation) and linking the EDUs into a DT (discourse parsing).
Discourse parsers can be singled out by whether they apply a bottom-up or top-down procedure. Bottom-up parsers include transition-based models (Feng and Hirst, 2014;Ji and Eisenstein, 2014;Braud et al., 2017;Wang et al., 2017) or globally optimized chart parsing models (Soricut and Marcu, 2003;Joty et al., 2013Joty et al., , 2015. The former constructs a DT by a sequence of shift and reduce decisions, and can parse a text in asymptotic running time that is linear in number of EDUs. However, the transition-based parsers make greedy local decisions at each decoding step, which could propagate errors into future steps. In contrast, chart parsers learn scoring functions for sub-trees and adopt a CKY-like algorithm to search for the highest scoring tree. These methods normally have higher accuracy but suffer from a slow parsing speed with a complexity of O(n 3 ) for n EDUs. The top-down parsers are relatively new in discourse (Lin et al., 2019;Zhang et al., 2020;Kobayashi et al., 2020). These methods focus on finding splitting points in each iteration to build a DT. However, the local decisions could still affect the performance as most of the methods are still greedy.
Like most other fields in NLP, language parsing has also undergone a major paradigm shift from traditional feature-based statistical parsing to end-toend neural parsing. Being able to parse a document end-to-end from scratch is appealing for several key reasons. First, it makes the overall development procedure easily adaptable to new languages, domains and tasks by surpassing the expensive feature engineering step that often requires more time and domain/language expertise. Second, the lack of an explicit feature extraction phase makes the training and testing (decoding) faster.
Because of the task complexity, it is only recently that neural approaches have started to outperform traditional feature-rich methods. However, successful document level neural parsers still rely heavily on handcrafted features Yu et al., 2018;Zhang et al., 2020;Kobayashi et al., 2020). Therefore, even though these methods adopt a neural framework, they are not "end-to-end" and do not enjoy the above mentioned benefits of an end-to-end neural parser. Moreover, in existing methods (both traditional and neural), discourse segmentation is detached from parsing and treated as a prerequisite step. Therefore, the errors in segmentation affect the overall parsing performance (Soricut and Marcu, 2003;Joty et al., 2012).
In view of the limitations of existing approaches, in this work we propose an end-to-end top-down document level parsing model that: • Can generate a discourse tree from scratch without requiring discourse segmentation as a prerequisite step; rather, it generates the EDUs as a by-product of parsing. Crucially, this novel formulation facilitates solving the two tasks in a single neural model. Our formulation is generic and works in the same way when it is provided with the EDU segmentation.
• Treats discourse parsing as a sequence of splitting decisions at token boundaries and uses a seq2seq pointer network (Vinyals et al., 2015) to model the splitting decisions at each decoding step. Importantly, our seq2seq parsing model can adopt beam search to widen the search space for the highest scoring tree, which to our knowledge is also novel for the parsing problem.
• Does not rely on any handcrafted features, which makes it faster to train or test, and easily adaptable to other domains and languages.
• Achieves the state of the art (SoTA) with an F 1 score of 46.6 in the Full (label+structure) metric for end-to-end parsing on the English RST Discourse Treebank, which outperforms many parsers that use gold EDU segmentation. With gold segmentation, our model achieves a SoTA F 1 score of 50.2 (Full), outperforming the best existing system by 2.1 absolute points. More imporantly, it does so without using any handcrafted features (not even part-of-speech tags).

Model
Assuming that a document has already been segmented into EDUs, following the traditional approach, the corresponding discourse tree (DT) can be represented as a set of labeled constituents.
where m = |C| is the number of internal nodes in the tree and r t is the relation label between the discourse unit containing EDUs i t through k t and the one containing EDUs k t + 1 through j t .
Traditionally, in RST parsing, discourse segmentation is performed first to obtain the sequence of EDUs, which is followed by the parsing process to assemble the EDUs into a labeled tree. In other words, traditionally discourse segmentation and parsing have been considered as two distinct tasks that are solved by two different models.
On the contrary, in this work we take a radically different approach that directly starts with parsing the (unsegmented) document in a top-down manner and treats discourse segmentation as a special case of parsing that we get as a by-product. Importantly, this novel formulation of the problem allows us to solve the two problems in a single neural model. Our parsing model is generic and also works in the same way when it is fed with an EDU-segmented text. Before presenting the model architecture, we first formulate the problem as a splitting decision problem at the token level.

Parsing as a Splitting Decision Problem
We reformulate the discourse parsing problem from Eq. (1) as a sequence of splitting decisions at token boundaries (instead of EDUs). Specifically, the input text is first prepended and appended with the special start (<sod>) and end (<eod>) tokens, respectively. We define the token-boundary as the indexed position between two consecutive tokens. For example, the constituent spanning "But he added :" in Fig. 2 is defined as (0, 4).
parsing as a set of splitting decisions S at tokenboundaries by the following proposition: Proposition 1 Given a binarized discourse tree for a document containing n tokens, the tree can be converted into a set of token-boundary splitting decisions S = {(i, j) ) k|i < k ≤ j} such that the parent constituent (i, j) either gets split into two child constituents (i, k) and (k, j) for k < j, or forms a terminal EDU unit for k = j, i.e., the span will not be split further (i.e., marks segmentation).
Notice that S is a generalized formulation of RST parsing, which also includes the decoding of EDUs as a special case (k = j). It is quite straightforward to change this formulation to the parsing scenario, where discourse segmentation (sequence of EDUs) is provided. Formally, in that case, the tree can be converted into a set of splitting decisions S edu = {(i, j) ) k|i < k < j} such that the constituent (i, j) gets split into two constituents (i, k) and (k, j) for k < j, i.e., we simply omit the special case of k = j as the EDUs are given. In other words, in our generalized formulation, discourse segmentation is just one extra step of parsing, and can be done top-down end-to-end.
An example of our formalism of the parsing problem is shown in Fig. 1 for a discourse tree spanning over two sentences (44 tokens); for simplicity, we do not show the relation labels corresponding to the splitting decisions (marked by )). Since each splitting decision corresponds to one and only one internal node in the tree, it guarantees that the transformation from the tree to S (and S edu ) has a one-to-one mapping. Therefore, predicting the sequence of such splitting decisions is equivalent to predicting the discourse tree (DT).
Seq2Seq Parsing Model. In this work, we adopt a structure-then-label framework. Specifically, we factorize the probability of a DT into the probability of the tree structure and the probability of the relations (i.e., the node labels) as follows: where x is the input document, and S and L respectively denote the structure and labels of the DT. This formulation allows us to first infer the best tree structure (e.g., using beam search), and then find the corresponding labels. As discussed, we consider the structure prediction problem as a sequence of splitting decisions to generate the tree in a top-down manner. We use a seq2seq pointer network (Vinyals et al., 2015) to model the sequence of splitting decisions (Fig. 3). We adopt a depth-first order of the decision sequence, which showed more consistent Figure 3: Our discourse parser along with a few decoding steps for a given document. The input to the decoder at each step is the representation of the span to be split. We predict the splitting point using the biaffine function between the corresponding decoder state and the token-boundary encoder representations. The figure is for end-to-end parsing, where each EDU-corresponding span points to its right edge to mark the EDU. The coherence relations between the left and right spans are assigned using a label classifier after the (approximately) optimal tree structure is formed using beam search. performance in our preliminary experiments than other alternatives, such as breath-first order.
First, we encode the tokens in a document x = (x 0 , . . . , x n ) with a document encoder and get the token-boundary representations (h 0 , . . . , h n ). Then, at each decoding step t, the model takes as input an internal node (i t , j t ), and produces an output y t (by pointing to the token boundaries) that represents the splitting decision (i t , j t ) ) k t to split it into two child constituents (i t , k t ) and (k t , j t ). For example, the initial span (0, 44) in Fig. 1 is split at boundary position 4, yielding two child spans (0, 4) and (4, 44). If the span (0, 4) is given as an EDU (i.e., segmentation given), the splitting stops at (0, 4), thus omitted in S edu (Fig. 1). Otherwise, an extra decision (0, 4) ) 4 ∈ S needs to be made to mark the EDUs for end-to-end parsing. With this, the probability of S can be expressed as: This end-to-end conditional splitting formulation is the main novelty of our method and is in contrast to previous approaches which rely on offlineinferred EDUs from a separate discourse segmenter. Our formalism streamlines the overall parsing process, unifies the neural components seamlessly and smoothens the training process.

Model Architecture
In the following, we describe the components of our parsing model: the document encoder, the boundary and span representations, the decoding process through the decoder and the label classifier. Document Encoder. Given an input document of n words x = (x 1 , . . . , x n ), we first add <sod> and <eod> markers to the sequence. After that, each token x i in the sequence is mapped into its dense vector representation e i as: , where e char i , and e word i are respectively the character and word embeddings of token x i . For word embedding, we experiment with (i) randomly initialized, (ii) pretrained static embeddings ,e.g., GloVe (Pennington et al., 2014)). To represent the character embedding of a token, we apply a character bidirectional LSTM i.e., Bi-LSTM (Hochreiter and Schmidhuber, 1997) or pretrained contextualized embeddings, e.g., XLNet (Yang et al., 2019). The token representations are then passed to a sequence encoder of a three-layer Bi-LSTM to obtain their forward f i and backward b i contextual representations.
Token-boundary Span Representations. To represent each token-boundary position k between token positions k and k + 1, we use the fencepost representation (Cross and Huang, 2016): where f k and b k+1 are the forward and backward LSTM hidden vectors of positions k and k + 1 respectively, and [·; ·] is the concatenation operation. Then, to represent the token-boundary span (i, j), we use the linear combination of the two endpoints i and j as: where W 1 and W 2 are trainable weights. These span representations will be used as input to the decoder or the label classifier. Fig. 4 illustrates an example boundary span representation.
The Decoder. Our model uses a unidirectional LSTM as the decoder. At each decoding step t, the decoder takes as input the corresponding span (i, j) (i.e., h i,j ) and its previous LSTM state d t−1 to generate the current state d t and then the biaffine function (Dozat and Manning, 2017) is applied between d t and all the encoded token-boundary representations (h 0 , h 1 , . . . , h n ) as follows: for i = 0, . . . , n where each MLP operation comprises a linear transformation with LeakyReLU activation (Maas et al., 2013) to transform d i and h i into equal-sized vectors d t , h i ∈ IR d , and W dh ∈ IR d×d and w h ∈ IR d are respectively the weight matrix and weight vector for the biaffine function. The resulting biaffine scores s i t are then fed into a softmax layer to acquire the pointing distribution a i t ∈ [0, 1] n+1 for the splitting decision. During inference, when decoding the tree at step t, we only examine the "valid" splitting points between i and j, and we look for k such that i < k ≤ j.
Label Classifier. We perform label assignment after decoding the entire tree structure. Each assignment takes into account the splitting decision that generated it since the label represents the relation between the child spans. Specifically, for a constituent (i, j) that was split into two child constituents (i, k) and (k, j), we determine the coherence relation between them as follows: where L is the total number of labels (i.e., coherence relations with nuclearity attached); each of MLP l and MLP r includes a linear transformation with LeakyReLU activation to transform the left and right spans into equal-sized vectors h l ik , h r kj ∈ IR d ; W lr ∈ IR d×L×d , W l ∈ IR d×L , W r ∈ IR d×L are the weights and b is a bias vector.
Training Objective. Our parsing model is trained by minimizing the total loss defined as: where structure L s and label L l losses are crossentropy losses computed for the splitting and labeling tasks respectively, and θ e , θ d and θ l denote the encoder, decoder and labeling parameters.

Complete Discourse Parsing Models
Having presented the generic framework, we now describe how it can be easily adapted to the two parsing scenarios: (i) end-to-end parsing and (ii) parsing with EDUs. We also describe the incorporation of beam search for inference.
End-to-End Parsing. As mentioned, previous work for end-to-end parsing assumes a separate segmenter that provides EDU-segmented texts to the parser. Our method, however, is an end-to-end framework that produces both the EDUs as well as the parse tree in the same inference process. To guide the search better, we incorporate an inductive bias into our inference based on the finding that most sentences have a well-formed subtree in the document-level tree (Soricut and Marcu, 2003), i.e., discourse structure tends to align with the text structure (sentence boundary in this case); for example, Fisher and Roark (2007); Joty et al. (2013) found that more than 95% of the sentences have a well-formed subtree in the RST discourse treebank. Our goal is to ensure that each sentence corresponds to an internal node in the tree. This can be achieved by a simple adjustment in our inference. When decoding at time step t with the span (i t , j t ) as input, if the span contains M > 0 sentence boundaries within it, we pick the one that Algorithm 1 Discourse Tree Inference (end-to-end) Input: Document length n; boundary encoder states: (h0, h1, . . . , hn); sentence boundary set SB ; label scores: P (l|(i, k), (k, j)), 0 ≤ i < k ≤ j ≤ n, l ∈ L, initial decoder state st.
has the highest pointing score (Eq. 7) among the M alternatives as the split point k t . If there is no sentence boundary within the input span (M = 0), we find the next split point as usual. In other words, sentence boundaries in a document get the chance to be split before the token boundaries inside a sentence. This constraint is indeed similar to the 1S-1S (1 subtree for 1 sentence) constraint of Joty et al. (2013)'s bottom-up parsing, and is also consistent with the property that EDUs are always within the sentence boundary. Algorithm 1 illustrate the end-to-end inference algorithm.
Parsing with EDUs. When segmentation information is provided, we can have a better encoding of the EDUs to construct the tree. Specifically, rather than simply taking the token-boundary representation corresponding to the EDU boundary as the EDU representation, we adopt a hierarchical approach, where we add another Bi-LSTM layer (called "Boundary LSTM") that connects EDU boundaries (a figure of this framework is in the Appendix). In other words, the input sequence to this LSTM layer is (h 0 , . . . , h m ), where h 0 = h 0 , h m = h n and h j ∈ {h 1 , . . . , h n−1 } such that h j is an EDU boundary. For instance, for the example in Fig. 1, the input to the Boundary LSTM layer is (h 0 , h 4 , h 17 , h 25 , h 33 , h 37 , h 44 ).
This hierarchical representation facilitates better modeling of relations between EDUs and higher order spans, and can capture long-range dependencies better, especially for long documents. Initialize first item (log-prob,state,input_span,tree) Current span to split a, s = decoder-step(s, hi,j ) a: split prob. dist. for (k, p k ) ∈ top-B(a) and i < k < j do curr-input-span = input-span curr-tree = tree curr- Keep top-B highest score trees end for logp*, s * , ip * , S * = arg max logp beam[L d ] S * : best structure DT = [(i, k, j, arg max l P θ (l|(i, k), (k, j)) ∀(i, k, j) ∈ S * ] Incorporating Beam Search. Previous work (Lin et al., 2019;Zhang et al., 2020) which also uses a seq2seq architecture, computes the pointing scores over the token or span representations only within the input span. For example, for an input span (i, j), the pointing scores are computed considering only (h i , . . . , h j ) as opposed to (h 1 , . . . , h n ) in our Eq. 7. This makes the scales of the scores uneven across different input spans as the lengths of the spans vary. Thus, such scores cannot be objectively compared across sub-trees globally at the full-tree level. In addition, since efficient global search methods like beam search cannot be applied properly with non-uniform scores, these previous methods had to remain greedy at each decoding step. In contrast, our decoder points to all the encoded token-boundary representations in every step (Eq. 7). This ensures that the pointing scores are evenly scaled, allowing fair comparisons between the scores of all candidate sub-trees. Therefore, our method enables the effective use of beam search through highly probable candidate trees. Algorithm 2 illustrates the beam search inference when EDUs are given.

Experiments
We conduct our experiments on discourse parsing with and without gold segmentation. We use the standard English RST Discourse Treebank or RST-DT (Lynn et al., 2002) for training and evaluation.
It consists of 385 annotated Wall Street Journal news articles: 347 for training and 38 for testing. We randomly select 10% of the training set as our development set for hyper-parameter tuning. Following prior work, we adopted the same 18 courser relations defined in (Carlson and Marcu, 2001). For evaluation, we report the standard metrics Span, Nuclearity, Relation and Full F1 scores, computed using the standard Parseval (Morey et al., , 2018 and RST-Parseval (Marcu, 2000) metrics.

Parsing with Gold Segmentation
Settings. Discourse parsing with gold EDUs has been the standard practice in many previous studies. We compare our model with ten different baselines as shown in Table 1. We report most results from Morey et al. (2018); Zhang et al. (2020); Kobayashi et al. (2020), while we reproduce Yu et al. (2018) using their provided source code.
For our model setup, we use the encoder-decoder framework with a 3-layer Bi-LSTM encoder and 3-layer unidirectional LSTM decoder. The LSTM hidden size is 400, the word embedding size is 100 for random initialization, while the character embedding size is 50. The hidden dimension in MLP modules and biaffine function for structure prediction is 500. The beam width B is 20. Our model is trained by Adam optimizer (Kingma and Ba, 2015) with a batch size of 10000 tokens. Our learning rate is initialized at 0.002 and scheduled to decay at an exponential rate of 0.75 for every 5000 steps. Model selection for testing is performed based on the Full F1 score on the development set. When using pretrained word embeddings, we use the 100D vectors from GloVe (Pennington et al., 2014). For pretrained model, we use the XLNet-base-cased version (Yang et al., 2019). 1 The pretrained models/embeddings are kept frozen during training. Table 1, we see that our model with GloVe (static) embeddings achieves a Full F1 score of 46.8, the highest among all the parsers that do not use pretrained models (or contextual embeddings). This suggests that a BiLSTMbased parser can be competitive with effective modeling. The model also outperforms the one proposed by Zhang et al. (2020), which is closest to ours in terms of modelling, by 3.9%, 4.1%, 2.4% and 2.5% absolute in Span, Nuclearity, Relation Table 1: Parsing results with gold segmentation. The sign + denotes that systems use handcrafted features such as lexical, syntactic, sentence/paragraph boundary features and so on, * denotes that systems use external cross-lingual features and § means that systems use pretrained models. and Full, respectively. More importantly, our system achieves such results without relying on external data or features, in contrast to previous approaches. In addition, by using XLNet-base pretrained model, our system surpasses all existing methods (with or without pretraining) in all four metrics, achieving the state of the art with 2.9%, 4.0%, 2.4% and 2.1% absolute improvements. It also reduces the gap between system performance and human agreement. When evaluated with the RST-Parseval (Marcu, 2000) metric, our model outperforms the baselines by 0.6%, 1.4% and 1.8% in Span, Nuclearity and Relation, respectively.

End-to-end Parsing
For end-to-end parsing, we compare our method with the model proposed by Zhang et al. (2020). Their parsing model uses the EDU segmentation from Li et al. (2018). Our method, in contrast, predicts the EDUs along with the discourse tree in a unified process ( §2.3). In terms of model setup, we use a setup identical to the experiments with gold segmentation ( §3.1). Table 2 reports the performance for documentlevel end-to-end parsing. Compared to Zhang et al. (2020)   achieves even better performance and outperforms many models that use gold segmentation (Table 1).

EDU Segmentation Results.
Our end-to-end parsing method gets an F1 score of 96.30 for the resulting EDUs. Our result rivals existing SoTA segmentation methods -92.20 F1 of Li et al. (2018) and 95.55 F1 of Lin et al. (2019). This shows the efficacy of our unified framework for not only discourse parsing but also segmentation. 2

Ablation Study
To further understand the contributions from the different components of our unified parsing framework, we perform an ablation study by removing selected components from a network trained with the best set of parameters.
With Gold Segmentation. Table 3 shows two ablations for parsing with gold EDUs. We see that both beam search and boundary LSTM (hierarchical encoding as shown in Fig. 7) are important to the model. The former can find better tree structure by searching a larger searching space. The latter, meanwhile, connects the EDU-boundary representations, which enhances the model's ability to capture long-range dependencies between EDUs. 2 We could not compare our segmentation results with the DISRPT 2019 Shared Task (Zeldes et al., 2019) participants. We found few inconsistencies in the settings. First, in their "gold sentence" dataset, instead of using the gold sentence, they pre-process the text with an automatic tokenizer and sentence segmenter. Second, in the evaluation, under the same settings, they do not exclude the trivial BeginSegment label at the beginning of each sentence which we exclude in evaluating our segmentation result (following the RST standard).

Model
Span   End-to-end Parsing. For end-to-end parsing, Table 4 shows that the sentence boundary constraint ( §2.3) is indeed quite important to guide the model as it decodes long texts. Since sentence segmentation models are quite accurate, they can be employed if ground truth sentence segmentation is not available. We also notice that pretraining (GloVe) leads to improved performance.
Error Analysis. We show our best parser's (with gold EDUs) confusion matrix for the 10 most frequent relation labels in Fig. 5. The complete matrix with the 18 relations is shown in Appendix (Fig. 8).
The imbalanced relation distribution in RST-DT affects our model's performance to some extent. Also semantic similar relations tend to be confused with each other. Fig. 6 shows an example where our model mistakenly labels Summary as Elaboration.
However, one could argue that the relation Elaboration is also valid here because the parenthesized text brings additional information (the equivalent amount of money). We show more error examples in the Appendix (Fig. 9 -11), where our parser la-   Table 5 compares the parsing speed of our models with a representative non-neural (Feng and Hirst, 2014) and neural model (Yu et al., 2018). We measure speed empirically using the wall time for parsing the test set. We ran the baselines and our models under the same settings (CPU: Intel Xeon W-2133 and GPU: Nvidia GTX 1080 Ti). With gold-segmentation, our model with GloVe embeddings can parse the test set in 19 seconds, which is up to 11 times faster than (Feng and Hirst, 2014), and this is when their features are precomputed. The speed gain can be attributed to (i) to the efficient GPU implementation of neural modules to process the decoding steps, and (ii) the fact that our model does not need to compute any handcrafted features. With pretrained models, our parser with gold segmentation is about 2.4 times faster than (Yu et al., 2018). Our end-to-end parser that also performs segmentation is faster than the baselines that are provided with the EDUs. Nonetheless, we believe there is still room for speed improvement by choosing a better network, like the Longformer (Beltagy et al., 2020) which has an O(1) parallel time complexity in encoding a text, compared to the O(n) complexity of the recurrent encoder.

Related Work
Discourse analysis has been a long-established problem in NLP. Prior to the neural tsunami in NLP, discourse parsing methods commonly em-ployed statistical models with handcrafted features (Soricut and Marcu, 2003;Hernault et al., 2010;Feng and Hirst, 2014;Joty et al., 2015). Even within the neural paradigm, most previous studies still rely on external features to achieve their best performances Wang et al., 2017;Braud et al., 2016Braud et al., , 2017Yu et al., 2018). These parsers adopt a bottom-up approach, either transition-based or chart-based parsing.
Recently, top-down parsing has attracted more attention due to its ability to maintain an overall view of the input text. Inspired by the Stack-Pointer network (Ma et al., 2018) for dependency parsing, Lin et al. (2019) first propose a seq2seq model for sentence-level parsing. Zhang et al. (2020) extend this to the document level. Kobayashi et al. (2020) adopt a greedy splitting mechanism for discourse parsing inspired by Stern et al. (2017)'s work in constituency parsing. By using pretrained models/embeddings and extra features (e.g., syntactic, text organizational features), these models achieve competitive results. However, their decoder infers a tree greedily.
Our approach differs from previous work in that it can perform end-to-end discourse parsing in a single neural framework without needing segmentation as a prerequisite. Our model can parse a document from scratch without relying on any external features. Moreover, it can apply efficient beam search decoding to search for the best tree.

Conclusion
We have presented a novel top-down end-to-end method for discourse parsing based on a seq2seq model. Our model casts discourse parsing as a series of splitting decisions at token boundaries, which can solve discourse parsing and segmentation in a single model. In both end-to-end parsing and parsing with gold segmentation, our parser achieves state-of-the-art, surpassing existing methods by a good margin, without relying on handcrafted features. Our parser is not only more effective but also more efficient than the existing ones.
This work leads us to several future directions. Our short-term goal is to improve the model with better architecture and training mechanisms. For example, joint training on discourse and syntactic parsing tasks could be a good future direction since both tasks are related and can be modeled within our unified conditional splitting framework. We also plan to extend our parser to other languages.   Figure 11: Our system incorrectly labels a Explanation as Elaboration.