Joint Syntacto-Discourse Parsing and the Syntacto-Discourse Treebank

Discourse parsing has long been treated as a stand-alone problem independent from constituency or dependency parsing. Most attempts at this problem rely on annotated text segmentations (Elementary Discourse Units, EDUs) and sophisticated sparse or continuous features to extract syntactic information. In this paper we propose the first end-to-end discourse parser that jointly parses in both syntax and discourse levels, as well as the first syntacto-discourse treebank by integrating the Penn Treebank and the RST Treebank. Built upon our recent span-based constituency parser, this joint syntacto-discourse parser requires no preprocessing efforts such as segmentation or feature extraction, making discourse parsing more convenient. Empirically, our parser achieves the state-of-the-art end-to-end discourse parsing accuracy.

We argue for the first time that discourse parsing should be viewed as an extension of, and be performed in conjunction with, constituency parsing.We propose the first joint syntacto-discourse treebank, by unifying constituency and discourse tree representations.Based on this, we propose the first end-to-end incremental parser that jointly parses at both constituency and discourse levels.Our algorithm builds up on the span-based parser (Cross and Huang, 2016); it employs the strong generalization power of bi-directional LSTMs, and parses efficiently and robustly with an extremely simple span-based feature set that does not use any tree structure information.
We make the following contributions: 1. We develop a combined representation of constituency and discourse trees to facilitate parsing at both levels without explicit conversion mechanism.Using this representation, we build and release a joint treebank based on the Penn Treebank (Marcus et al., 1993) and RST Treebank (Marcu, 2000a,b) (Section 2).
2. We propose a novel joint parser that parses at both constituency and discourse levels.Our parser performs discourse parsing in an endto-end manner, which greatly reduces the efforts required in preprocessing the text for segmentation and feature extraction, and, to our best knowledge, is the first end-to-end discourse parser in literature (Section 3).
3.Even though it simultaneously performs constituency parsing, our parser does not use any explicit syntactic feature, nor does it need any binarization of discourse trees, thanks to the powerful span-based framework of Cross and Huang (2016) (Section 3).
4. Empirically, our end-to-end parser outperforms the existing pipelined discourse parsing efforts.When the gold EDUs are provided, our parser is also competitive to other existing approaches with sophisticated features (Section 4).

Combined Representation & Treebank
We first briefly review the discourse structures in Rhetorical Structure Theory (Mann and Thompson, 1988), and then discuss how to unify discourse and constituency trees, which gives rise to our syntacto-discourse treebank PTB-RST.

Review: RST Discourse Structures
In an RST discourse tree, there are two types of branchings.Most of the internal tree nodes are binary branching, with one nucleus child containing the core semantic meaning of the current node, and one satellite child semantically decorating the nucleus.Like dependency labels, there is a relation annotated between each satellite-nucleus pair, such as "Background" or "Purpose".Figure 1(a) shows an example RST tree.There are also nonbinary-branching internal nodes whose children are conjunctions, e.g., a "List" of semantically similar EDUs (which are all nucleus nodes); see Figure 2(a) for an example.

Syntacto-Discourse Representation
It is widely recognized that lower-level lexical and syntactic information can greatly help determining both the boundaries of the EDUs (i.e., discourse segmentation) (Bach et al., 2012) as well as the semantic relations between EDUs (Soricut and Marcu, 2003;Hernault et al., 2010;Joty and Moschitti, 2014;Feng and Hirst, 2014;Ji and Eisenstein, 2014;Li et al., 2014a;Heilman and Sagae, 2015).While these previous approaches rely on pre-trained tools to provide both EDU segmentation and intra-EDU syntactic parse trees, we instead propose to directly determine the low-level segmentations, the syntactic parses, and the highlevel discourse parses using a single joint parser.This parser is trained on the combined trees of constituency and discourse structures.We first convert an RST tree to a format similar  to those constituency trees in the Penn Treebank (Marcus et al., 1993).For each binary branching node with a nucleus child and a satellite child, we use the relation as the label of the converted parent node.The nucleus/satellite relation, along with the direction (either ← or →, pointing from satellite to nucleus) is then used as the label.For example, at the top level in Figure 2, we convert For a conjunctive branch (e.g."List"), we simply use the relation as the label of the converted node.
After converting an RST tree into the constituency tree format, we then replace each leaf node (i.e., EDU) with the corresponding syntactic (sub)tree from PTB.Given that the sentences in the RST Treebank (Marcu, 2000b) is a subset of that of PTB, we can always find the corresponding constituency subtrees for each EDU leaf node.In most cases, each EDU corresponds to one single (sub)tree in PTB, since the discourse boundaries generally do not conflict with constituencies.In other cases, one EDU node may correspond to multiple subtrees in PTB, and for these EDUs we use the lowest common ancestor of those subtrees in the PTB as the label of that EDU in the converted tree.E.g., if C-D is one EDU in the PTB tree

Joint PTB-RST Treebank
Using the conversion strategy described above we build the first joint syntacto-discourse treebank  We follow the standard training/testing split of the RST Treebank.In the training set, there are 347 joint trees with a total of 17,837 tokens, and the lengths of the discourses range from 30 to 2,199 tokens.In the test set, there are 38 joint trees with a total of 4,819 tokens, and the lengths vary from 45 to 2,607.Figure 3 shows the distribution of the discourse lengths over the whole dataset, which on average is about 2x of PTB sentence length, but longest ones are about 10x the longest lengths in the Treebank.

Joint Syntacto-Discourse Parsing
Given the combined syntacto-discourse treebank, we now propose a joint parser that can perform end-to-end discourse segmentation and parsing.

Extending Span-based Parsing
As mentioned above, the input sequences are substantially longer than PTB parsing, so we choose linear-time parsing, by adapting a popular greedy constituency parser, the span-based constituency parser of Cross and Huang (2016).
As in span-based parsing, at each step, we maintain a a stack of spans.Notice that in conventional incremental parsing, the stack stores the subtrees  k Some text and the symbol or scaled j .In other words, quite shockingly, no tree structure is represented anywhere in the parser.Please refer Cross and Huang (2016) for details.Similar to span-based constituency parsing, we alternate between structural (either shift or combine) and label (label X or nolabel) actions in an odd-even fashion.But different from Cross and Huang (2016), after a structural action, we choose to keep the last branching point k, i.e., i Some text and the symbol or scaled k j (mostly for combine, but also trivially for shift).This is because in our parsing mechanism, the discourse relation between two EDUs is actually determined after the previous combine action.We need to keep the splitting point to clearly find the spans of the two EDUs to determine their relations.This midpoint k disappears after a label action; therefore we can use the shape of the last span on the stack (whether it contains the split point, i.e., i xt and the symbol or scaled k j or i Some text and the symbol or scaled j ) to determine the parity of the step and thus no longer need to carry the step z in the state as in Cross and Huang (2016).
The nolabel action makes the binarization of the discourse/constituency tree unnecessary, because nolabel actually combines the top two spans on the stack σ into one span, but without annotating the new span a label.This greatly simplifies the preprocessing and post-processing efforts needed.

Recurrent Neural Models and Training
The scoring functions in the deductive system (Figure 4) are calculated by an underlying neural model, which is similar to the bi-directional LSTM model in Cross and Huang (2016) that evaluates based on span boundary features.Again, it is important to note that no discourse or syntactic tree structures are represented in the features.
During the decoding time, a document is firstl passed into a two-layer bi-directional LSTM model, then the outputs at each text position of the two layers of the bi-directional LSTMs are concatenated as the positional features.The spans at each parsing step can be represented as the feature vectors at the boundaries.The span features are then passed into fully connected networks with softmax to calculate the likelihood of performing the corresponding action or marking the corresponding label.
We use the "training with exploration" strategy (Goldberg and Nivre, 2013) and the dynamic oracle mechanism described in Cross and Huang (2016) to make sure the model can handle unseen parsing configurations properly.

Empirical Results
We use the treebank described in Section 2 for empirical evaluation.We randomly choose 30 documents from the training set as the development set.
We tune the hyperparameters of the neural model on the development set.For most of the hyperparameters we settle with the same values suggested by Cross and Huang (2016).To alleviate the overfitting problem for training on the relative small RST Treebank, we use a dropout of 0.5.
One particular hyperparameter is that we use a value β to balance the chances between training following the exploration (i.e., the best action chosen by the neural model) and following the correct path provided by the dynamic oracle.We find that β = 0.8, i.e., following the dynamic oracle with a probability of 0.8, achieves the best performance.One explanation for this high chance to follow the oracle is that, since our combined trees are signif- icantly larger than the constituency trees in Penn Treebank, lower β makes the parsing easily divert into wrong trails that are difficult to learn from.Since our parser essentially performs both constituency parsing task and discourse parsing task.We also evaluate the performances on sentence constituency level and discourse level separately.The result is shown in Table 1.Note that in constituency level, the accuracy is not directly comparable with the accuracy reported in Cross and Huang (2016), since: a) our parser is trained on a much smaller dataset (RST Treebank is about 1/6 of Penn Treebank); b) the parser is trained to optimize the discourse-level accuracy.
Table 2 shows that, in the perspective of endto-end discourse parsing, our parser first outperforms the state-of-the-art segmentator of Bach et al. (2012), and furthermore, in end-to-end parsing, the superiority of our parser is more pronounced comparing to the previously best parser of Hernault et al. (2010).
On the other hand, the majority of the conventional discourse parsers are not end-to-end: they rely on gold EDU segmentations and pre-trained tools like Stanford parsers to generate features.We perform an experiment to compare the per-formance of our parser with them given the gold EDU segments (Table 3).Note that most of these parsers do not handle multi-branching discourse nodes and are trained and evaluated on binarized discourse trees (Feng and Hirst, 2014;Li et al., 2014a,b;Ji and Eisenstein, 2014;Heilman and Sagae, 2015), so their performances are actually not directly comparable to the results we reported.

Conclusion
We have presented a neural-based incremental parser that can jointly parse at both constituency and discourse levels.To our best knowledge, this is the first end-to-end parser for discourse parsing task.Our parser achieves the state-of-the-art performance in end-to-end parsing, and unlike previous approaches, needs little pre-processing effort.

Figure 2 :
Figure 2: Another example of RST vs. PTB-RST, demonstrating a discourse tree over two sentences and a non-binary relation (List).The lower levels of the PTB-RST tree are collapsed due to space contraints.
two examples of discourse trees and their combined syntacto-discourse trees.
constructed so far, but in span-based constituency parsing, the stack only stores the boundaries of subtrees, which are just a list of indices ... i Some text and the symbol or scaled

Table 1 :
Accuracies on PTB-RST at constituency and discourse levels.

Table 3 :
Experiments using gold segmentations.The column of "syntactic feats" shows how the syntactic features are calculated in the corresponding systems.Note that our parser predicts solely based on the span features from bi-directionaly LSTM, instead of any explicitly designed syntactic features.