Discourse Probing of Pretrained Language Models

Existing work on probing of pretrained language models (LMs) has predominantly focused on sentence-level syntactic tasks. In this paper, we introduce document-level discourse probing to evaluate the ability of pretrained LMs to capture document-level relations. We experiment with 7 pretrained LMs, 4 languages, and 7 discourse probing tasks, and find BART to be overall the best model at capturing discourse — but only in its encoder, with BERT performing surprisingly well as the baseline model. Across the different models, there are substantial differences in which layers best capture discourse information, and large disparities between models.


Introduction
The remarkable development of pretrained language models (Devlin et al., 2019;Lewis et al., 2020;Lan et al., 2020) has raised questions about what precise aspects of language these models do and do not capture. Probing tasks offer a means to perform fine-grained analysis of the capabilities of such models, but most existing work has focused on sentence-level analysis such as syntax (Hewitt and Manning, 2019;Jawahar et al., 2019;de Vries et al., 2020), entities/relations (Papanikolaou et al., 2019), and ontological knowledge (Michael et al., 2020). Less is known about how well such models capture broader discourse in documents.
Rhetorical Structure Theory is a framework for capturing how sentences are connected and describing the overall structure of a document (Mann and Thompson, 1986). A number of studies have used pretrained models to classify discourse markers (Sileo et al., 2019) and discourse relations (Nie et al., 2019;Shi and Demberg, 2019), but few (Koto et al., to appear) have systematically investigated the ability of pretrained models to model discourse structure. Furthermore, existing work relating to discourse probing has typically focused exclusively  Table 1: Summary of all English pretrained language models used in this work. "MLM" = masked language model, "NSP" = next sentence prediction, "SOP" = sentence order prediction, "LM" = language model, "DISC" = discriminator, and "DAE" = denoising autoencoder.
on the BERT-base model, leaving open the question of how well these findings generalize to other models with different pretraining objectives, for different languages, and different model sizes.
Our research question in this paper is: How much discourse structure do layers of different pretrained language models capture, and do the findings generalize across languages?
There are two contemporaneous related studies that have examined discourse modelling in pretrained language models. Upadhye et al. (2020) analyzed how well two pretrained models capture referential biases of different classes of English verbs. Zhu et al. (2020) applied the model of Feng and Hirst (2014) to parse IMDB documents (Maas et al., 2011) into discourse trees. Using this (potentially noisy) data, probing tasks were conducted by mapping attention layers into single vectors of document-level rhetorical features. These features, however, are unlikely to capture all the intricacies of inter-sentential abstraction as their input is formed based on discourse relations 1 and aggregate statistics on the distribution of discourse units.   Figure 1: Illustration of the RST discourse probing tasks (Tasks 4-6).
To summarize, we introduce 7 discourse-related probing tasks, which we use to analyze 7 pretrained language models over 4 languages: English, Mandarin Chinese, German, and Spanish. Code and public-domain data associated with this research is available at https://github.com/fajri91/discourse_ probing.

Pretrained Language Models
We outline the 7 pretrained models in Table 1 ; and 2 encoder-decoder models: BART (Lewis et al., 2020) and T5 (Raffel et al., 2019). To reduce the confound of model size, we use pretrained models of similar size (∼110m model parameters), with the exception of ALBERT which is designed to be lighter weight. All models have 12 transformer layers in total; for BART and T5, this means their encoder and decoder have 6 layers each. Further details of the models are provided in the Supplementary Material.

Probing Tasks for Discourse Coherence
We experiment with a total of seven probing tasks, as detailed below. Tasks 4-6 are component tasks of discourse parsing based on rhetorical structure theory (RST; Mann and Thompson (1986)). In an RST discourse tree, EDUs are typically clauses or sentences, and are hierarchically connected with discourse labels denoting: (1) nuclearity = nucleus (N) vs. satellite (S); 2 and (2) discourse relations (e.g. elaborate). An example of a binarized RST discourse tree is given in Figure 1. 1. Next sentence prediction. Similar to the next sentence prediction (NSP) objective in BERT pretraining, but here we frame it as a 4-way classification task, with one positive and 3 negative candidates for the next sentence. The preceding context takes the form of between 2 and 8 sentences, but the candidates are always single sentences.
2. Sentence ordering. We shuffle 3-7 sentences and attempt to reproduce the original order. This task is based on Barzilay and Lapata (2008) and Koto et al. (2020), and is assessed based on rank correlation relative to the original order.  Figure 2: Probing task performance on English for each of the seven tasks, plus the average across all tasks. For BART and T5, layers 7-12 are the decoder layers. All results are averaged over three runs, and the vertical line for each data point denotes the standard deviation (noting that most results have low s.d., meaning the bar is often not visible).
or, or although (Nie et al., 2019), representing the conceptual relation between the sentences/clauses. 4. RST nuclearity prediction. For a given ordered pairing of (potentially complex) EDUs which are connected by an unspecified relation, predict the nucleus/satellite status of each (see Figure 1).

RST relation prediction.
For a given ordered pairing of (potentially complex) EDUs which are connected by an unspecified relation, predict the relation that holds between them (see Figure 1).

Experimental Setup
We summarize all data (sources, number of labels, and data split) in Table 2. This includes English, Chinese, German, and Spanish for each probing task. For NSP and sentence ordering, we generate data from news articles and Wikipedia. For the RST tasks, we use discourse treebanks for each of the four languages. We formulate all probing tasks except sentence ordering and EDU segmentation as a classification problem, and evaluate using accuracy. During fine-tuning, we add an MLP layer on top of the pretrained model for classification, and only update the MLP parameters (all other layers are frozen). We use the [CLS] embedding for BERT and AL-BERT following standard practice, while for other models we perform average pooling to obtain a vector for each sentence, and concatenate them as the input to the MLP. 3 For sentence ordering, we follow Koto et al. (2020) and frame it as a sentence-level sequence labelling task, where the goal is to estimate P (r|s), where r is the rank position and s the sentence. The task has 7 classes, as we have 3-7 sentences (see Section 3). At test time, we choose the label sequence that maximizes the sequence probability. Sentence embeddings are obtained by average pooling. The EDU segmentation task is also framed as a binary sequence labelling task (segment boundary or not) at the (sub)word level. We use Spearman rank correlation and macro-averaged F1 score to evaluate sentence ordering and EDU segmentation, respectively.
We use a learning rate 1e − 3, warm-up of 10% of total steps, and the development set for early stopping in all experiments. All presented results are averaged over three runs. 4

Results and Analysis
In Figure 2, we present the probing task performance on English for all models based on a representation generated from each of the 12 layers of the model. First, we observe that most performance fluctuates (non-monotonic) across layers except for some models in the NSP task and some ALBERT results in the other probing tasks. We also found that most models except ALBERT tend to have a very low standard deviation based on three runs with different random seeds.
We discover that all models except T5 and early layers of BERT and ALBERT perform well over the NSP task, with accuracy ≥ 0.8, implying it is a simple task. However, they all struggle at sentence ordering (topping out at ρ ∼ 0.4), suggesting that they are ineffective at modelling discourse over multiple sentences; this is borne out in Figure 4, where performance degrades as the number of sentences to re-order increases.
Interestingly, for Discourse Connectives, RST Nuclearity, and RST Relation Prediction, the models produce similar patterns, even though the discourse connective data is derived from a different dataset and theoretically divorced from RST. BART outperforms most other models in layers 1-6 for these tasks (a similar observation is found for NSP and Sentence Ordering) with BERT and ALBERT struggling particularly in the earlier layers. For is in included in the Appendix. 4 More details of the training configuration are given in the Appendix. EDU segmentation, RoBERTa and again the first few layers of BART perform best. For the Cloze Story Test, all models seem to improve as we go deeper into the layers, suggesting that high-level story understanding is captured deeper in the models.
We summarize the overall performance by calculating the averaged normalized scores in the last plot in Figure 2. 5 RoBERTa and BART appear to be the best overall models at capturing discourse information, but only in the encoder layers (the first 6 layers) for BART. We hypothesize that the BART decoder focuses on sequence generation, and as such is less adept at language understanding. This is supported by a similar trend for T5, also a denoising autoencoder. BERT does surprisingly well (given that it's the baseline model), but mostly in the deeper layers (7-10), while ELECTRA performs best at the three last layers.
In terms of the influence of training data, we see mixed results. BART and RoBERTa are the two best models, and both are trained with more data than most models (an order of magnitude more; see Table 1). But T5 (and to a certain extent GPT-2) are also trained with more data (in fact T5 has the most training data), but their discourse modelling performance is underwhelming. In terms of training objectives, it appears that a pure decoder with an LM objective (GPT-2) is less effective at capturing discourse structure. ALBERT, the smallest model (an order of magnitude less parameters than most), performs surprisingly well (with high standard deviation), but only at its last layer, suggesting that discourse knowledge is concentrated deep inside the model.
Lastly, we explore whether these trends hold if we use a larger model (BERT-base vs. BERT-large) and for different languages (again based on monolingual BERT models for the respective languages). Results are presented in Figure 3. For model size ("English (large)" vs. "English"), the overall pattern is remarkably similar, with a slight uplift in absolute results with the larger model. Between the 4 different languages (English, Chinese, German, and Spanish), performance varies for all tasks except for NSP (e.g. EDU segmentation appears to be easiest in Chinese, and relation prediction is the hardest in German), but the shape of the lines is largely the same, indicating the optimal layers for

Conclusion
We perform probing on 7 pretrained language models across 4 languages to investigate what discourse effects they capture. We find that BART's encoder and RoBERTa perform best, while pure language models (GPT-2) struggle. Interestingly, we see a consistent pattern across different languages and model sizes, suggesting that the trends we found are robust across these dimensions.

Model
Huggingface model  We use spaCy (https://spacy.io/) to perform sentence tokenization, and ensure that the distractor options in the training set do not overlap with the test set. For all languages and models, the training configurations are similar: the maximum tokens in the context and the next sentence are 450 and 50, respectively. If the token lengths are more than this, we truncate the context from the beginning of the sequence, and truncate the next sentence at the end of the sequence. We concatenate context with each option, and perform binary classification. Other training configuration details: learning rate = 1e-3, Adam epsilon =1e-8, maximum gradient norm = 1.0, maximum epochs = 20, warmup = 10% of the training steps, and patience for early stopping = 5 epochs.

B.2 Sentence Ordering
In generating sentence ordering data, we once again use spaCy (https://spacy.io/) to perform sentence tokenization. For all languages and models, the #Sentence (context) Total 2 2500 4 2500 6 2500 8 2500 Total 10000 Options 0: The channel recently said its signal was carried by 22 satellites 0: That step has become a huge challenge for opposition candidates 0: Six men were convicted and then acquitted of the atrocity and no-one has since been convicted of involvement in the bombing 1: A search is continuing for eight people who remain missing.  training configurations are similar, with the maximum tokens in each sentence = 50, learning rate = 1e-3, Adam epsilon = 1e-8, maximum gradient norm = 1.0, training epochs = 20, warmup = 10% of the training steps, and patience for early stopping = 10 epochs.

B.3 Discourse Connective Prediction
As our Chinese and German data is extracted from discourse treebanks, the number of distinct connective words varies. For instance, in the Chinese discourse treebank, we find 246 unique connective Context s0: West Mercia Police said the police do not encourage members of the public to pursue their own investigations. s1: David John Poole, from Hereford, poses online as a 14-year-old girl and says he has been sent hundreds of explicit messages. s2: He says his work has led to two arrests in four weeks.
Correct order: 2-0-1 words. To simplify this, we set the connective word to OTHER if its word frequency is less than 12. For all languages and models, the training configurations are: maximum token length of each sentence = 50, learning rate = 1e-3, Adam epsilon = 1e-8, maximum gradient norm = 1.0, maximum epochs = 20, warmup = 10% of the training steps, and patience for early stopping = 10 epochs.

B.4 RST-related Tasks
In Figures 7 and 8, we present the distribution of the nuclearity and relation labels for the 4 different discourse treebanks. The English treebank is significantly larger, with a strong preference for the NS (nuclear-satellite) relationship. Unlike other languages, the proportion of NN (nuclear-nuclear) relationships in the Chinese discourse treebank (CDTB) is the highest. We also notice that the relation label set in CDTB is the simplest, with only 4 labels.
Most of the training details for nuclearity and relation prediction are the same as for the NSP task, except we set the maximum token length of each sentence to 250. Particularly for EDU segmentation, we set the maximum token length in a document to 512.

B.5 Cloze Story Test
As discussed in Table 2, we use cloze story test version-1 (Mostafazadeh et al., 2016). Although version-2 (Sharma et al., 2018) is better in terms of story biases, the gold labels for the test set are not publicly available, which limited our ability to explore different layers of a broad range of pretrained language models (due to rate limiting of test evaluation).

F [CLS] vs. Average Pooling in English BERT-base Model
Average pooling generally performs worse than [CLS] embeddings in the last layers of BERT.  Figure 11: Comparison of [CLS] vs. average pooling embeddings for BERT-base across the five tasks for English. Please note that sentence ordering and EDU segmentation are always performed with average pooling embeddings and sequence labelling at the (sub)word level, respectively.