Improving Neural RST Parsing Model with Silver Agreement Subtrees

Most of the previous Rhetorical Structure Theory (RST) parsing methods are based on supervised learning such as neural networks, that require an annotated corpus of sufficient size and quality. However, the RST Discourse Treebank (RST-DT), the benchmark corpus for RST parsing in English, is small due to the costly annotation of RST trees. The lack of large annotated training data causes poor performance especially in relation labeling. Therefore, we propose a method for improving neural RST parsing models by exploiting silver data, i.e., automatically annotated data. We create large-scale silver data from an unlabeled corpus by using a state-of-the-art RST parser. To obtain high-quality silver data, we extract agreement subtrees from RST trees for documents built using the RST parsers. We then pre-train a neural RST parser with the obtained silver data and fine-tune it on the RST-DT. Experimental results show that our method achieved the best micro-F1 scores for Nuclearity and Relation at 75.0 and 63.2, respectively. Furthermore, we obtained a remarkable gain in the Relation score, 3.0 points, against the previous state-of-the-art parser.


Introduction
Rhetorical Structure Theory (RST) (Mann and Thompson, 1987) is one of the most widely used theories for representing the discourse structure of a text as a tree. RST trees are a kind of constituent tree, whose leaves are Elementary Discourse Units (EDUs), i.e., clause-like units, and whose non-terminal nodes cover text spans consisting of either a sequence of EDUs or a single EDU. The label of a non-terminal node represents the attribution of a text span, i.e., nucleus (N) or satellite (S). A discourse relation is also assigned between two adjacent non-terminal nodes.
In most cases, RST parsers have been developed on the basis of supervised learning algorithms (Wang et al., 2017b;Yu et al., 2018;Kobayashi et al., 2020;Lin et al., 2019;Zhang et al., 2020), which require a high-quality annotated corpus of sufficient size. Generally, they train the following three components of the RST parsing: (1) structure prediction by splitting a text span consisting of contiguous EDUs into two smaller ones or merging two adjacent spans into a larger one, (2) nuclearity status prediction for two adjacent spans by solving a 3-class classification problem, and (3) relation label prediction for two adjacent spans by solving an 18-class classification problem (see Section 3.3 for details). However, it is costly to annotate RST trees for a huge collection of documents, and thus it is difficult to obtain a large amount of human-annotated data for RST parsing. As a result, research on RST parsing has focused on English, with the largest annotated corpus being the RST Discourse Treebank (RST-DT) (Carlson et al., 2001), although even this is still small with only 385 documents. 1 Many RST parsing methods have recently been developed based on neural models (Ji and Eisenstein, 2014;Li et al., 2014aLi et al., , 2016Liu and Lapata, 2017;Braud et al., 2016Braud et al., , 2017. Among them, Kobayashi et al. (2020) is the current state-of-theart system and is based on the neural top-down method. While its Span and Nuclearity scores achieved the highest level, its Relation score still has room for improvement. One of the reasons for its poor Relation score might be its small amount of training data for solving the 18-class classification problem.
Currently, we can refer to various studies on improving neural models for NLP tasks through acquiring large-scale synthetic training data, sometimes called silver data. Among them, one of the studies on Neural Machine Translation (NMT) (Sennrich et al., 2016) introduced a simple learning framework: first pre-train an NMT model with silver data, i.e., pseudo-parallel data generated by automatic back-translation, and then fine-tune it with gold data, i.e., real parallel data, to overcome the data sparseness problem. Since the frameworks successfully improved the NMT systems, it has become a standard approach.
Inspired by the above research, we propose a method for improving a student neural parser by exploiting large-scale silver data, thus generating RST trees using an automatic RST parser. 2 Specifically, we improve the state-of-the-art neural RST parser (Kobayashi et al., 2020), in terms of Relation, by employing another RST parser whose Relation score is also state-of-the-art (Wang et al., 2017b) as a teacher parser to generate the silver data. To yield high-quality silver data, we extract a collection of agreement subtrees (ASTs), which are common subtrees among multiple RST trees automatically parsed by the teacher parser with different seeds. Our method includes an efficient algorithm for extracting the agreement subtrees to handle large-scale data. We first pre-train the student parser by using the obtained silver data. We then fine-tune parameters of the parser on gold data, using the RST-DT. Experimental results on the RST-DT clearly indicate the effectiveness of our silver data. Our method obtained remarkable Nuclearity and Relation F 1 scores of 75.0 and 63.2, respectively.

Related Work
Early studies on RST parsing were based on traditional supervised learning methods with handcrafted features and the shift-reduce or CKY-like parsing algorithms (duVerle and Prendinger, 2009;Feng and Hirst, 2012;Joty et al., 2013Joty et al., , 2015Feng and Hirst, 2014). Recently, Wang et al. (2017b) proposed a shift-reduce parser based on SVMs and achieved the current best results in classical statistical models on the RST-DT. The method first built nuclearity-labeled RST trees and then assigned relation labels between two adjacent spans consisting of a single or multiple EDUs.
Inspired by the success of neural networks in 2 Nguyen et al. (2020) proposed a similar approach in NMT and introduced a method named data diversification: it diversifies the training data by using multiple forward and backward translation models. We can find some weak supervision approaches for other discourse representation formalisms such as (Badene et al., 2019). many NLP tasks, several neural network-based models have been proposed for RST parsing (Ji and Eisenstein, 2014;Li et al., 2014aLi et al., , 2016Liu and Lapata, 2017). Yu et al. (2018) proposed a shift-reduce parser based on neural networks and leveraged the information from their neural dependency parsing model within a sentence for RST parsing. The best Relation score on the RST-DT, i.e., F 1 of 60.2, was achieved with their method.
Recently, a top-down neural parser was proposed (Lin et al., 2019) for use only at the sentencelevel. The method parses a tree in a depth-first manner with a pointer-generator network. Zhang et al. (2020) extended the method and applied it to document-level RST parsing. Kobayashi et al. (2020) proposed another top-down RST parsing method exploiting multiple granularity levels in a document and achieved the best Span and Nuclearity scores on the RST-DT, i.e., F 1 of 87.0 and 74.6, respectively.
Since the RST-DT, the largest treebank, contains only 385 documents, several studies have been conducted on overcoming the problem of a limited number of training data. Braud et al. (2016) leveraged multi-task learning not only with 13 related tasks as an auxiliary task but also for multiple views of discourse structures, such as Constituent, Nuclearity, and Relation. Braud et al. (2017) used multilingual RST discourse datasets that share the same underlying linguistic theory. Huber and Carenini (2019) adopted distant supervision with an auxiliary task of sentiment classification to create largescale training data, i.e., they trained a two-stage RST parser (Wang et al., 2017a) with RST trees automatically built based on attention and sentiment scores from the Multiple-Instance Learning network, which was trained with a review dataset. However, these studies need other annotated corpora than the RST-DT, which means we still face the problem of being dependent on costly annotated corpora. Jiang et al. (2016) proposed a framework for enriching training data based on co-training to improve the performance for infrequent relation labels. However, the method failed to improve the overall Relation score, while they did not aim at improving the Span and Nuclearity scores.
Unsupervised RST parsing methods have also been proposed recently (Kobayashi et al., 2019;Nishida and Nakayama, 2020). Since they are unsupervised, they do not require any annotated corpora. However, they can predict only tree structures and  Figure 1: Overview of proposed method. In the subtree extraction step, the teacher RST parsers first annotate trees to unlabeled documents, and then the proposed subtree extraction method constructs large silver data. In the training step, the student parser is trained through pre-training and fine-tuning. cannot predict nucleus and relation labels. Therefore, the predicted trees cannot be used for learning for predicting relation labels.
We should mention the relationship of our work with semi-supervised learning as a machine learning framework. First, the reason why we do not adopt self-training, where the student and teacher parsers are the same, but instead use two different parsers that rely on different parsing algorithms is that we can acquire instances that the student parser cannot correctly parse yet the teacher parser can parse as the training data. Second, using multiple different RST parsers in a semi-supervised manner in our work might seem reminiscent of co-or tritraining. While co-or tri-training is attractive, it is time consuming to repeat the step of alternately training multiple different neural network-based parsers many times. Thus, previous studies have focused on simplifying the repetition step in constituency and dependency parsing (McClosky et al., 2006;Yu et al., 2015;Pekar et al., 2014;Weiss et al., 2015;Li et al., 2014b).
We believe our method is similar to these simplified version as a semi-supervised framework with two different RST parsers.
3 Neural RST Parsing with Silver Data

Training Student Parser
Traditional semi-supervised learning frameworks, such as self-, co-, and tri-training, tend to iteratively train a student classifier with the training data that contains human-annotated (gold) data and iteratively added silver data. Since neural networkbased models require a large amount of time for training, the iterative procedure is not suitable for training them. Furthermore, the training method may be affected by the bias problem in relationlabel distribution because frequent labels in the original training data become yet more frequent in the future training data. For these reasons, we adopt a simple pre-training and fine-tuning strategy, which is inspired by the NMT research (Sennrich et al., 2016), to train a student RST parser.
Since early statistical RST parsing methods relied on handcrafted features, i.e., sentence-level features obtained from parse trees and documentlevel features, they require complete documents with complete sentences for their feature extraction. On the other hand, recent neural models do not necessarily need such features. Thus, we can exploit subtrees as training data for the neural networks.
Our method involves the following two steps: First, we extract a collection of ASTs from RST trees for each document in unlabeled data as the silver data. In this step, each document is first parsed using multiple teacher RST parsers with different seeds, trained with a gold dataset, the RST-DT. We then apply our algorithm for extracting the ASTs, which are common subtrees among multiple automatically parsed RST trees. In the second step, we pre-train the student RST parser with the collection of ASTs to complement the amount of training data. The parameters of the student parser are then finetuned on the RST-DT. Figure 1 shows an overview of our proposed method.

Extracting Agreement Subtrees
A good strategy for obtaining high-quality silver data is to get agreement among the results of multiple RST parsers. However, it is difficult to reach agreement for the entire RST trees at the documentlevel because their size is big. Thus, we believe we cannot collect enough silver data using agreement for the whole trees. On the other hand, we find that many subtrees agreed among multiple RST trees, even when the whole trees do not agree with each other. Accordingly, we extract ASTs as the silver data.
To create large-scale silver data, we need an efficient algorithm that extracts the ASTs, i.e., com-

Algorithm 1: Extracting Agreement Subtrees
Input: trees Output: subtrees 1 AGREEMENT(root(tree)) 2 subtrees ← FINDROOT(root(tree)) 3 Function AGREEMENT ( mon subtrees among multiple RST trees for a document. Note that we need to extract multiple maximal common subtrees among the RST trees. This requires a different algorithm from the maximum agreement subtree problem, which is well-known in bioinformatics (Deepak and Fernández-Baca, 2014). Thus, we develop the algorithm in Algorithm 1. This algorithm follows a tree-traversal algorithm and works with O(n), where n indicates the number of nodes in an RST tree.
In the algorithm, a tree is represented as a fullylabeled nested span structure (see the example in Figure 2). The function AGREEMENT receives an arbitrary span as the input and returns a Boolean value indicating whether the subtree for the span is an AST. AGREEMENT first counts how many times the input span appears in the set of given RST trees and checks the status of the left and right children of the input span. Len() returns the length of the span and Count() returns the frequency of the fullylabeled span among the trees, which indicates how many trees agree on the subtree. The minimum and maximum values of Count() are 1 and k respectively, where k indicates the number of RST trees. The variables S c , S l , and S r store the Boolean value for the input span and the left and right children of the span, respectively. Here, root, leftChild, and rightChild are functions for returning the root span and the left and right children spans, respectively. To obtain the status of each child, AGREEMENT calls itself with the child span. When the frequency of the input span is k, indicating all of the trees in the set agree on the span, and the status of the left and right children is True, indicating the left and right children are ASTs, the function returns True. Furthermore, the information regarding which subtrees are ASTs is stored in variable S during the execution of AGREEMENT.
The function FINDROOT returns the list of ASTs, based on the information in variable S, given by AGREEMENT. FINDROOT first checks the S(span), the Boolean value of the span. If it is True, the function appends the span, corresponding to the root node of an AST, to the output. Otherwise, it searches both left and right children for ASTs recursively. The function, therefore, lists all of the maximal ASTs in a depth-first fashion, based on the information in variable S.
In the algorithm, l min and l max are used to control the size of extracted ASTs. If the trees parsed using the multiple teacher parsers significantly differ from each other, the extracted ASTs tend to be small, which might become noise. To avoid such noise, we do not take into account subtrees with less than l min EDUs. Excessively large subtrees are difficult to handle because they need a lot of time and space for training. Therefore, if the size of subtrees exceeds l max , the algorithm tries to find smaller ASTs from both their left and right children.
Initially, we call the function AGREEMENT with an arbitrary tree in multiple RST trees. We show an example of extracting ASTs in Figure 2. Assume the two trees at the left are from two RST parsers. The right part represents how the algorithm works with the top tree at the left as the input. In the figure, two subtrees consisting of spans (1,4) and (5,7) are extracted as ASTs since the frequency of (1,10) (  these two spans and all their descendant spans is 2, which is the number of given RST trees. Note that, while several spans, such as spans (2,3) and (6,7), are also common subtrees, we do not extract them since they are contained in either span (1,4) or (5,7).

Span-based Neural Top-down Parser as student Parser
As described in Section 3.1, the advantage of recent neural models is that they can utilize the annotation for partial documents, or subtrees, as training data. Among the neural models, the span-based neural top-down RST parsing method (Kobayashi et al., 2020) achieved the best Span and Nuclearity scores. Thus, we employ it as the student parser. The method builds a tree by recursively splitting a text span into two smaller ones while predicting the nuclearity status and relation labels. As we explain below, the parser can be trained with arbitrary subtrees for spans consisting of EDUs.

Structure Prediction
For each position k in a span which consists of i-th EDU to j-th EDU, a scoring function, s split (i, j, k), is defined as follows: where W u is a weight matrix and v and v r are weight vectors corresponding to the left and right spans, respectively. h i:k and h k+1:j are defined as follows: h i:k = MLP left (u i:k ), h k+1:j = MLP right (u k+1:j ), where MLP * is the multi-layer perceptron. The vector representation of a span, u i:j , is obtained by feeding word embedding vectors into LSTMs. Then, the span is split at position k that maximizes Eq. (1): (2)

Label Prediction
When splitting a span at position k, the score of the nuclearity status and relation labels for the two spans is defined as follows: s label (i, j, k, ) = W MLP([u i:k ; u k+1:j ; u1:i; uj:n]), where W is a weight matrix and u 1:i ; u j:n are vector representation of left and right spans that appear outside the current focus. Then, the label that maximizes Eq.
(3) is assigned to the spans: where L denotes a set of valid nuclearity status combinations, {N-S,S-N,N-N}, for predicting the nuclearity, and a set of relation labels, {Elaboration, Condition,. . .}, for predicting the relation. Accordingly, we solve a 3-class classification problem for the nuclearity labeling and an 18-class classification problem for the relation labeling. Note that the weight parameters W and MLP for the nuclearity and relation labeling are separately learned.

Parameter Optimization
All parameters, W u , W , v r , v , and the parameters for LSTMs are optimized by using margin-based learning. When the correct splitting position k * and labels * are given, loss functions for splitting and labeling are defined as follows: max(0, 1+s split (i, j, k * )−s split (i, j,k)), max(0, 1+s label (i, j,k, * )−s label (i, j,k,ˆ )).
By minimizing the sum of the losses in each splitting point, the parameters are optimized.

Two-stage Parser as Teacher Parser
Since the student parser still has room for improvement in Relation, it is desirable to utilize another state-of-the-art parser based on a different parsing algorithm with a good Relation score. While the current best Relation score was achieved by NNDis-Parser (Yu et al., 2018), we cannot reproduce this score with their official code. Therefore, we employ the two-stage parser (Wang et al., 2017b), which obtained the second-best Relation score, as the teacher parser. This two-stage parser is based on a shift-reduce parsing algorithm and utilizes SVMs to determine actions to build trees. Since their SVMs are optimized by a dual coordinate descent method, we build multiple two-stage parser models with different seeds to obtain enough agreement between teacher parsers, 3 and create silver data by the agreement among the parsers.

Datasets
We used the RST-DT to evaluate the performance of our student RST parser and compared it with state-of-the-art parsers. It is officially divided into 347 documents as the training dataset and 38 documents as the test dataset. Since there is no development dataset, we used a part of the training dataset, 40 documents, as the development dataset by following the previous study (Heilman and Sagae, 2015). By following conventional studies, we used gold EDU segmentation for the RST-DT. The training and development datasets were used as gold data to fine-tune our student parser. To obtain silver data for pre-training, we used the CNN dataset (Hermann et al., 2015). To parse each document, we split sentences into EDUs by using 3 We could not adopt multiple different parser architectures, for example, the Two-stage parser and Span-based parser, because the agreement was low. the Neural EDU Segmenter (Wang et al., 2018) 4 and applied the two-stage parser.

Settings
l min and l max for AST extraction: Since the number of EDUs for a document in the RST-DT is from 7 to 240, we selected l min with a range from 5 to 10 and set l max to 240. Based on the results for the development dataset, l min was fixed to 9 (see Appendix A for details). Student Parser: We used the official code of the span-based neural top-down parsing method. 5 The dimension of the hidden layers was set to 500. We trained the model in 5 and 10 epochs for pretraining and fine-tuning, respectively. Other parameters of the model and an optimizer were the same as those used by Kobayashi et al. (2020) (see Appendix E for details). Kobayashi et al. (2020) achieved the best results in the D2P2S2E setting, training the models in three levels of granularity, i.e., paragraph trees for documents, sentence trees for paragraphs, and EDU trees for sentences. This setting requires us to train many models corresponding to multiple granularity levels. To simplify this, we trained only the model for building an RST tree whose leaves are EDUs for a document, which corresponds to their D2E setting. In decoding, we split spans at sentence and paragraph boundaries to make the setting closer to D2P2S2E.
We also used ensemble decoding by following  Table 2: Micro-averaged F 1 scores of span-based neural top-down parser with or without silver data on the test dataset of the RST-DT. S, N, R, and F represent Span, Nuclearity, Relation, and Full scores, respectively. The best score for each metric in the average and ensemble settings is indicated with bold. Kobayashi et al. (2020). Since it takes a large amount of time to train multiple models in pretraining, we trained only a single model in the pretraining stage, while multiple models were trained in the fine-tuning stage with the pre-trained model as the initial state. Teacher parser: 6 We used the official code of the two-stage parsing method 7 and re-trained it four times with different random seeds. A smaller value of k made reliability of the agreement lower since we could not exclude coincidentally agreed trees. On the other hand, a larger value required us more time to create silver data, while the reliability of the agreement is high. Thus, we set k to 4, that is a moderate number in terms of both the reliability of the agreement and the data creation time.

Evaluation Metrics
By following previous studies (Sagae and Lavie, 2005), we transformed RST-trees into right-heavy binary trees and evaluated system results with micro-averaged F 1 scores of Span, Nuclearity, Relation, and Full, based on RST-Parseval (Marcu, 2000). Span, Nuclearity, Relation, and Full were used to evaluate unlabeled, nuclearity-labeled, relation-labeled, and fully-labeled tree structures, respectively. Since Morey et al. (2017) made a suggestion to use a standard parseEval toolkit for evaluation, we also report the results using this in Appendix C.

Compared Methods
To demonstrate the effectiveness of our proposed method, we pre-trained the span-based neural topdown parser, i.e., our student parser, in various settings for creating the silver data and compared 6 We show the results for a case of using SBP as a teacher parser in Appendix B. 7 https://github.com/yizhongw/StageDP the performance after fine-tuning on the RST-DT. Table 1 summarizes the statistics of the different types of silver data. 'DT' denotes RST trees obtained by using a single two-stage parser. The number of RST trees is the same as that of documents in the CNN dataset. 'ADT' denotes agreement document-level RST trees, i.e., the cases in which the parsers built the same trees for the whole document. 'AST' denotes ASTs of RST trees obtained from the teacher parsers. Table 2 shows the average and ensemble scores with five models for different types of silver data. In the table, SBP indicates the results obtained from the original span-based neural top-down parser, which means the parser was trained only with the RST-DT; this setting is without any silver data. With AST as the silver data, performance in all metrics improved against the baseline. In most metrics, AST achieved the best scores. In particular, the gains in Relation and Full were impressive. DT and ADT, which consist of document-level RST trees, also outperformed the baseline. However, the gains against the baseline were smaller than those by AST. We believe this is related to the size and quality of the silver data. The number of trees and nodes in ADT is only 2,142 and 57,940, respectively, while AST has 175,709 trees and 2,279,275 nodes. Thus, a small number of silver data for pre-training is not effective. On the other hand, while DT has only 91,536 trees, the number of their nodes is huge, at about 8,000 K. The lower score of DT would come from unreliable parse trees contained in the silver data built by a single teacher parser. As described above, to pre-train the student parser, we do not need to use the entire RST trees for documents. Thus, AST, with a large collection of RST subtrees, is more effective than the other approaches. Since the training time depends on the number of nodes contained in the data, SBP+AST can be learned in a quarter of the time required by SBP+DT. Consequently, AST has another advantage against DT.

Different Methods for Constructing Silver Data
Furthermore, the performance of averaging five models was greatly improved by pre-training with the silver data. The gains against the baseline were larger than those for 'Ensemble,' and the differences between their performances became small. The neural model tends to converge to a different local optimum solution by mini-batch training, so the convergence is not stable when the data size is small. Pre-training can improve this. This is another advantage of pre-training with silver data.
We also compare the results of our parser pretrained with AST with and without fine-tuning in Appendix D.

Effect of Data Size
To investigate how the data size of AST for pretraining affects the performance, we show Span, Nuclearity, Relation, and Full scores while varying the size in Figure 3. Span scores showed only small gains even by increasing the amount of data because identifying splitting points for spans is a simple 2-class classification problem. On the other hand, identifying nuclearity and relation labels is a multi-class classification problem. Thus, we believe we need more training data than that for identifying splitting points. In particular, the Relation score could be improved with more silver data.

Detailed Analysis of Relation Labeling
To investigate the effectiveness of SBP+AST in more detail, we show Relation F 1 scores for relation labels with SBP, SBP+AST, and the twostage parser in Figure 4. The results of SBP and SBP+AST were obtained from a five-model ensemble. In most relation labels, since the two-stage parser, the teacher parser, is comparable or superior to SBP, i.e., the student parser, the performance of SBP+AST can be improved. It finally outperformed the two-stage parser by introducing pretraining with silver data, even for less frequent relation labels. Furthermore, SBP+AST can correctly parse for some relation labels that the student parser  Table 3: Comparison of state-of-the-art parsers. * indicates reported scores. The best score in each metric is indicated in bold. Our model is statistically significantly better than underlined scores at p-level < 0.01 in pairwise comparison. 8 cannot handle, by acquiring training instances with the help of the teacher parser.

Comparison with state-of-the-art parsers
Finally, we compare our SBP+AST with the ensemble to current state-of-the-art parsers. Table  3 shows the micro-averaged F 1 scores. We used Paired Bootstrap Resampling (Koehn, 2004) for the significance test. We can see that our method achieved the best scores except for Span. The gains against the previous best scores were 0.4, 3.0, and 2.7 points for Nuclearity, Relation, and Full, respectively. In particular, the gains for Relation and Full are remarkable.

Conclusion
To solve the problem of the limited amount of training data available for neural RST parsing, we proposed a method of exploiting agreement subtrees as silver data: We pre-train a parser with the silver data and fine-tune it with the gold data. We also presented an algorithm that efficiently extracts overlapping subtrees as the agreement subtrees from multiple trees. Experimental results on the RST-DT demonstrated that our method significantly improves the performance of relation-labeled and fully-labeled F 1 scores, which are strongly affected by data sparseness due to a small number of training data. Furthermore, the results showed that our method achieves the state-of-the-art nuclearity-labeled, relation-labeled, and fully-labeled F 1 scores.  A Effects of Parameter l min Figure 5 shows fully-labeled F 1 scores in changing l min on the development dataset. From the figure, it is clear that the F 1 score was changed with l min . The best F 1 score was achieved by l min = 9. The results indicate that a large number of smaller subtrees prevents better pre-training of SBP. A small number of larger subtrees is also not useful for the pre-training.
B Performance of a case when the Span-based parser was used both as teacher and student parsers We used different parsers for teacher and student parsers. In this section, we examine the setting of using the Span-based parser both as teacher and student parsers. We compare DT (SBP) and DT (TSP) as the silver datasets used for pre-training and show the results in Table 4. The results show that SBP+DT (SBP) does not obtain any gain of performance compared to SBP, that does not use any silver dataset. The better results with SBP+DT (TSP) demonstrate the effectiveness of using different types of parsers for teacher and student parsers.

C Performance with Original Parseval
In this paper, we used the gold EDU segmentation following conventional studies and evaluated the model performance for binarized trees with RST-Parseval (Marcu, 2000). Morey et al. (2017) reported that when evaluating the performance of binarized trees over manual EDU segmentation, the level of agreement between RST trees is artificially raised. To avoid this, they recommended using the original Parseval for the trees of label-attachment decisions. Following them, we evaluated our models with the original Parseval 9 and show the results in Table 5.
The results show the same tendency as that when employing RST-Parseval as the evaluation metrics. That is, SBP+AST obtained the best results for Nuclearity, Relation, and Full.   (Morey et al., 2017). * indicates the reported scores in (Morey et al., 2017). ** indicates the scores for the re-produced models.

D Performance of Parser with Pre-training Alone
In our method, we applied both pre-training and fine-tuning to the target neural parser because it is the conventional way to improve neural networkbased models. However, this might be different from the usual way of re-training models based on traditional supervised learning in a semi-supervised fashion. To investigate whether the approach with both pre-training and fine-tuning is effective, we compared other training methods, specifically, pretraining alone with the CNN and with both the CNN and the RST-DT, and the comparison results are shown in Table 6. The scores of Nucleus, Relation, and Full were not statistically significantly different from each other, which indicates that the difference between the two methods is minimal.  .
Furthermore, compared with the performance of the two-stage parser, i.e., our teacher parser, it is confirmed that our silver data provides adequate quality. Compared with the fine-tuned models, it is also confirmed that fine-tuning improves performance.  [5,6,7,8,9,10] l max 240