Multilingual Neural RST Discourse Parsing

Text discourse parsing plays an important role in understanding information flow and argumentative structure in natural language. Previous research under the Rhetorical Structure Theory (RST) has mostly focused on inducing and evaluating models from the English treebank. However, the parsing tasks for other languages such as German, Dutch, and Portuguese are still challenging due to the shortage of annotated data. In this work, we investigate two approaches to establish a neural, cross-lingual discourse parser via: (1) utilizing multilingual vector representations; and (2) adopting segment-level translation of the source content. Experiment results show that both methods are effective even with limited training data, and achieve state-of-the-art performance on cross-lingual, document-level discourse parsing on all sub-tasks.


Introduction
Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) is one of the most influential theories of discourse analysis, under which a document is represented by a hierarchical discourse tree. As shown in Figure 1a, the leaf nodes of an RST tree are text spans named Elementary Discourse Units (EDUs), and the EDUs are connected by rhetorical relations (e.g., Cause, Contrast) to form larger text spans until the entire document is included. The rhetorical relations are further categorized to Nucleus (core part) and Satellite (subordinate part) based on their relative importance. Thus, document-level discourse parsing consists of three sub-tasks: tree construction, nuclearity determination and relation classification. Moreover, downstream natural language processing tasks can benefit from RST-based structure-aware document analysis, such as summarization (Liu and Chen, 2019;Xu et al., 2020) and machine comprehension (Gao et al., 2020).
By utilizing various linguistic characteristics (e.g., N -gram and lexical features, syntactic and organizational features), statistical approaches have obtained substantial improvement on the English RST-DT benchmark (Sagae, 2009;Hernault et al., 2010;Joty et al., 2013;Li et al., 2014b;Heilman and Sagae, 2015). Recently, neural networks have been making inroads into discourse analysis frameworks, such as attention-based hierarchical encoding (Li et al., 2016) and integrating neural-based syntactic features into a transition-based parser (Yu et al., 2018). Lin et al. (2019) and their follow-up work  successfully explored encoder-decoder neural architectures on sentence-level discourse analysis, with a top-down parsing procedure.
Although discourse parsing has received much research attention and progress, the models are mainly optimized and evaluated in English. The main challenge is the shortage of annotated data, since manual annotation under the RST framework is labor-intensive and requires specialized linguistic knowledge. For instance, the most popular benchmark English RST-DT corpus (Carlson et al., 2002) only contains 385 samples, which is much smaller than those of other natural language processing tasks. The treebank size of other languages such as German (Stede and Neumann, 2014), Dutch (Redeker et al., 2012) and Basque (Iruskieta et al., 2013) are even more limited. Such limitations make it difficult to achieve acceptable performance on these languages required to fully support downstream tasks, and also lead to poor generalization ability of the computational approaches.
Since the treebanks of different languages share the same underlying linguistic theory, data-driven approaches can benefit from joint learning on multilingual RST resources (Braud et al., 2017). Therefore, in this paper, we investigate two methods to build a cross-lingual neural discourse parser: (1) From the embedding perspective: with the cross-lingual contextualized language models, we can train a parser on the shared semantic space from multilingual sources without employing a language indicator; (2) From the text perspective: since each EDU is a semantically-cohesive unit, we can unify the target language space by EDU-level translation, while preserving the original EDU segmentation and the discourse tree structures (see Figure 1c). To this end, we adapted and enhanced an end-to-end neural discourse parser, and investigated the two proposed approaches on 6 different languages. While the RST data for training is still in a small scale, we achieved the state-of-the-art performance on all fronts, significantly surpassing previous models, and even approaching the upper bound of human performance. Moreover, we conducted a topic modeling analysis on the collected multilingual treebanks to evaluate the model generality across various domains.

Neural Discourse Parser
Since the encoder-decoder neural architecture with a top-down parsing procedure proposed in  has achieved impressive performance on sentence-level discourse analysis, here we adapted and enhanced it on the document-level parsing task. The neural model consists of an encoder, a span splitting decoder, and a nuclearity-relation classifier. Encoder: The encoder produces EDU-level representations via a hierarchical encoding process. Given a document containing n tokens, an embedding layer is used to obtain token-level representations T = { t 1 , ..., t n }. Then we obtain EDU-level representations by averaging the token embeddings for each EDU, and feed them to a Bi-GRU (Cho et al., 2014) component for document-level context-aware modeling. Moreover, to exploit implicit syntactic information like part-of-speech (POS) and sentential information (Yu et al., 2018), we incorporate boundary embeddings at both ends of each EDU from T to the context-aware vectors, and obtain the final EDU representation E = {e 1 , ..., e m }, where m is the total EDU number. Span Splitting Decoder: The decoder splits spans of EDUs to form the tree structure in a top-down transition-based procedure. Figure 2 illustrates the parsing steps of the example in Figure 1: the decoder maintains a Stack, which is initialized by the span of all EDUs e 1:m . At each decoding step t, the span e i:j at the head of stack is parsed into two sub-spans e i:k and e k+1:j (i ≤ k < j), and k is the splitting position predicted via an attention-based pointer network (Bahdanau et al., 2015;Vinyals et al., 2015). Afterwards, spans containing more than one EDU are pushed into the stack, then the decoder iteratively parses the spans until Stack is empty. Figure 2: Document-level neural parser. t, e and h denote input tokens, encoded EDU representations, and decoded hidden states. The Stack is maintained by the decoder to track top-down depth-first span splitting. With each splitting pointer k, sub-spans e i:k and e k+1:j are fed to a classifier Φ for nuclearity and relation determination.
Nuclearity-Relation Classifier: At each decoding step, after the span e i:j is split into two sub-spans e i:k and e k+1:j , a bi-affine classifier (Dozat and Manning, 2017) is adopted to predict their nuclearity and relation labels. Here we use the joint labels of nuclearity and relation as previous studies (Yu et al., 2018;Lin et al., 2019). The total loss is specified as the sum of the cross entropy of span splitting and nuclearity-relation classification. Model implementation details and hyper-parameter configuration are described in Appendix A.

Multilingual Parsing
In this section, we introduce two approaches for the multilingual discourse parsing. Since both methods are model-independent, they can be adopted on various neural architectures, and extended to other lowresource scenarios.

Utilizing Cross-Lingual Vector Representations
Recently, the large-scale multilingual language models are able to provide universal encoders that project various inputs to a shared embedding space (Conneau and Lample, 2019), and are proved effective in natural language processing tasks such as machine comprehension. Therefore, to conduct discourse parsing on documents from various languages, we first propose to apply a cross-lingual representation backbone in the embedding layer in Section 2.1. Here, we utilize XLM-RoBERTa (Conneau et al., 2020), which supports 100 languages, and fine-tune it by joint training with the whole neural parser. Moreover, since BERT-based backbones usually have positional embedding limitation, to encode lengthy sequences without truncation for document-level discourse parsing, the sliding window strategy 1 is adopted for better long dependency modeling.

Adopting EDU Segment-Level Translation
Aside from using cross-lingual embedding, one alternative way is to transform multilingual text content into a monolingual space. While sophisticated neural approaches are able to generate multilingual translation with high quality and fluency, the commonly adopted sentence-level translation usually makes changes to the syntactic structure, which affects the original discourse annotation like the number and order of EDUs. Therefore, we propose to convert multilingual source content via EDU segment-level translation, as EDU segments are deemed to be semantically cohesive. We feed the documents with EDU segmentation (split by newlines) to a machine translator (Wu et al., 2016), then use the monolingual samples for training and evaluation. As shown in Figure 1, we observe that the translated material retains the original split and order at the EDU level, and shares the same English syntactic characteristics such as the position of discourse connective words (e.g., 'however', 'although') and relation pronouns (e.g., 'that', 'which'). Then, we train the neural parser in Section 2.1 on the translated samples with their original tree structure, nuclearity, and relation annotations.    Table 2: Evaluation scores on multilingual RST treebanks. * denotes results from (Braud et al., 2017). Sp, Nu and Rel denote span splitting, nuclearity and relation determination respectively.

Experimental Results and Analyses
In this section, we describe data collection and present the experimental setting, results and analyses of the proposed methods.

Data and Pre-processing
We constructed a multilingual dataset by collecting treebanks from 6 languages: English (En-DT), Brazilian Portuguese (Pt-DT), Spanish (Es-DT), German (De-DT), Dutch (Nl-DT), and Basque (Eu-DT), and their details are shown in Table 1. Since the annotated formats are slightly different among treebanks, we conducted data pre-processing as in (Braud et al., 2017) to uniform them. All samples were transformed into binary trees, and units that were not linked to others within the tree were removed. Following , we reorganized the discourse relations to 18 categories, and attached the nuclearity labels Nucleus-Satellite (NS), Satellite-Nucleus (SN) and Nucleus-Nucleus (NN) to the relation labels. For each language, we randomly selected 38 samples for evaluation. The total training set and test set are 1.1k and 228. For encoding input, we applied the pre-trained sub-word tokenizer of XLM-RoBERTa (Conneau et al., 2020). We adjusted random seeds to obtain multiple results for each language and used the average as reported scores.

Evaluation Result
The experimental results are shown in Table 2. Since macro-averaged and micro-averaged F1 scores are reported in different previous works, we conducted extensive comparisons using these two criteria. The results demonstrate that (1) models which are only trained on the English treebank (EN-Training) can achieve competitive span splitting performance on the multilingual test sets; (2) the two proposed approaches with multilingual training (Multi-Training) surpass the baselines with a significant margin at all fronts: the span splitting prediction on all languages are approaching human performance, and nuclearity and relation determination are improved substantially compared to previously reported crosslingual parsers (Braud et al., 2017); (3) Interestingly, the model with cross-lingual representation performs slightly better on the treebanks with fewer samples (e.g., De-DT, Nl-DT, and Eu-DT), and the model with segment-level translation obtains the best result in English.

Topic Modeling Analysis
To further assess the generality of our parsers from the domain perspective, we conducted a topic modeling analysis on the translated samples from multilingual treebanks. LDA (the topic number was set as 5) and t-SNE were used for topic modeling and feature visualization, respectively. As shown in Figure  3, the treebanks show a trend to cluster in different topics (marked in circles). For instance, the English treebank (En-DT) mainly focuses on the financial news domain (in blue). Compared to the Portuguese treebank (Pt-DT), the Spanish one (Es-DT) is more distinct to the En-DT, which is consistent with the performance gap between them under EN-Training (see Table 2). Therefore, by adding Spanish (Es-DT) and Portuguese (Pt-DT) data, topic coverage for the Multi-Training model expands to scientific and terminology articles, and thus becomes more generalizable to other domains.

Conclusion
In this paper, we investigated two approaches for cross-lingual neural discourse parsing. Experimental results show that both utilizing cross-lingual representation and adopting segment-level translation contribute to obtaining state-of-the-art performance on various treebanks. Moreover, monolingual models can also benefit from cross-lingual training by introducing data from more domains. For future work, we consider conducting domain adaption via few-shot learning to make our approach more generalizable.

A.1 Details of Encoder-Decoder
Given a document containing n tokens T = {t 1 , t 2 , ..., t n }, the embedding layer (a pre-trained language model) produces token-level embeddings T = { t 1 , ..., t n }, the EDU-level representations C = {c 1 , ..., c m } are calculated by averaging the respective token-level embeddings. Then, a multi-layer Bi-GRU is employed to generate the context-aware EDU-level representations V = {v 1 , ..., v m } by sequentially modeling the dependency among C, and each v i is the concatenated vector of the the forward and the backward hidden states: Afterwards, the final EDU representations are produced via incorporating boundary embeddings at the beginning and end of each EDU from T to the context-aware EDU vector v i : where ; denotes the concatenation operation. W e and b e are the trainable parameter matrix and bias. We employ a unidirectional GRU layer for the span splitting decoder, and its hidden state h 0 is initialized by the last hidden states of the encoder. At each decoding step, the hidden state h t is produced by the GRU with the previous hidden state h t−1 and the input span representation e i:k , where e i:k is calculated from taking the average of the respective EDU representations (i.e. mean(e i , ..., e k ) for e i:k ). Then, the pointer network (Vinyals et al., 2015) is used to predict the splitting position according to the computed attention scores on encoded EDU representations, which is a softmax distribution over the input span.
where σ(x, y) is the dot product used as attention scoring function.

A.2 Details of Nuclearity-Relation Classifier
After decoder splits span e i:j into left sub-span e i:k and right sub-span e k+1:j , the classifier first projects e l and e r to latent features e l and e r by a linear layer with Exponential Linear Unit (ELU), where e l and e r are the average of respective EDU representations in e i:k and e k+1:j : Then a bi-affine layer with softmax activation maps the features to nuclearity-relation labels: where W l ∈ R d×R ; W r ∈ R d×R and W lr ∈ R d×d×R are the weights and bias b ∈ R R .

A.3 Training Loss
The parser's objective contains two parts: building the discourse tree structure and predicting the nuclearity and discourse relation labels. Therefore, the total loss is the sum of structure loss L s and label prediction loss L l , where L s is the cross entropy loss upon attention probabilities of the pointer network and L l is the cross entropy loss of the nuclearity-relation classification.
where θ s and θ l are the parameters of the pointer network and classifier respectively, T is the total number of spans, and y 1 , ..., y t−1 denote the sub-trees that have been generated in the previous steps. M is the number of spans that need to be split, and R is the number of nuclearity-relation labels. The total loss with L 2 -regularization is: L total (θ * ) = L s (θ s ) + L l (θ l ) + λ||θ * || 2 2 (8) where λ is the regularization strength and θ * refers to all learning parameters of the model.

A.4 Hyper-parameter Configuration
The neural model was implemented in PyTorch (Paszke et al., 2019). We used 'xlm-roberta-base' implemented in (Wolf et al., 2019) and fine-tuned the last 4 layers during training. In order to exploit global contextual information, the window size was set as 500 and the stride size was 200. Documents were tokenized via the sub-word scheme as in (Conneau and Lample, 2019). We trained the model for 30 epochs and selected the best checkpoints on a validation set for evaluation. Adam optimization algorithm was used with batch size of 3, weight decay of 5e-5, and learning rate of 1e-4. Dropout rate was set as 0.5 during training. The embedding dimension and hidden size were 768 and 384. The trainable parameter size was 67M, where 31M parameters were from fine-tuning the language model. All experiments were run on a Tesla V100 GPU with 16GB memory.