Examining the rhetorical capacities of neural language models

Recently, neural language models (LMs) have demonstrated impressive abilities in generating high-quality discourse. While many recent papers have analyzed the syntactic aspects encoded in LMs, there has been no analysis to date of the inter-sentential, rhetorical knowledge. In this paper, we propose a method that quantitatively evaluates the rhetorical capacities of neural LMs. We examine the capacities of neural LMs understanding the rhetoric of discourse by evaluating their abilities to encode a set of linguistic features derived from Rhetorical Structure Theory (RST). Our experiments show that BERT-based LMs outperform other Transformer LMs, revealing the richer discourse knowledge in their intermediate layer representations. In addition, GPT-2 and XLNet apparently encode less rhetorical knowledge, and we suggest an explanation drawing from linguistic philosophy. Our method shows an avenue towards quantifying the rhetorical capacities of neural LMs.


Introduction
In recent years, neural LMs (especially contextualized LMs) have shown profound abilities to generate texts that could be almost indistinguishable from human writings (Radford et al., 2019). Neural LMs could be used to generate concise summaries (Song et al., 2019), coherent stories (See et al., 2019), and complete documents given prompts (Keskar et al., 2019). It is natural to question their source and extent of rhetorical knowledge: What makes neural LMs articulate, and how? While some recent works query the linguistic knowledge (Hewitt and Manning, 2019;Liu et al., 2019a;Chen et al., 2019;, this open question remain unanswered. We hypothesize that contextualized neural LMs encode rhetorical knowledge in their intermediate repre-sentations, and would like to quantify the extent they encode rhetorical knowledge. To verify our hypothesis, we hand-craft a set of 24 rhetorical features including those used to examine rhetorical capacities of students (Mohsen and Alshahrani, 2019; Liu and Kunnan, 2016;Zhang, 2013;Powers et al., 2001), and evaluate how well neural LMs encode these rhetorical features in the representations while encoding texts.
Recent work has started to evaluate encoded features from hidden representations. Among them, probing (Alain and Bengio, 2017;Adi et al., 2017) has been a popular choice. Previous work probed morphological Bisazza and Tump, 2018), agreement (Giulianelli et al., 2018), and syntactic features (Hewitt and Manning, 2019;Hewitt and Liang, 2019). Probing involves optimizing a simple projection model from representations to features. The loss of this optimization measures the difficulty to decode features from the representations.
In this work, we use a probe containing self attention mechanism. We first project the variablelength embeddings to a fixed-length latent representation per document. Then, we apply a simple diagnostic classifier to detect rhetorical features from this latent representation. This design of probe reduces the total number of parameters, and enable us to better understand each model's ability to encode rhetorical knowledge. We find that: • The BERT-based LMs encode more rhetorical features, and in a more stable manner, than other models. • The semantics of non-contextualized embeddings also pertain to some rhetorical features, but less than most layers of contextualized language models.
These observations allow us to investigate the mechanisms of neural LMs to better understand the degree to which they encode linguistic knowledge. We demonstrate how discourse-level features can be queried and analyzed from neural LMs. All of our code and parsed tree data will be available at github.

Structural analysis of discourse
Various frameworks exist for "good discourse" (Lawrence and Reed, 2019;Irish and Weiss, 2009;Toulmin, 1958), but most of them are inaccessible to quantitative analysis. In this work, we use Rhetorical Structure Theory (Mann and Thompson, 1988;Mann et al., 1989) since it represents the structures of discourse using trees, allowing straightforward quantitative analysis. There are two components in an RST parse-tree: • Each leaf node represents an elementary discourse unit (EDU). The role of an EDU in an article is similar to that of a word in a sentence. • Each non-leaf node denotes a relation involving its two children. Often, one of the children is more dependent on the other, and less essential to the writer's purpose. This child is referred to as "satellite", while the more central child is the "nucleus".

SN-Attribution
I didn't know this is from C but it is very good!  (2014). Nodes with rectangle borders are discourse relations, and those without borders are individual EDUs. The "N" and "S" prefix for discourse relations stand for "nucleus" and "satellite" respectively. Tree representations are clear, easy to understand, and allow us to compute features to numerically depict the rhetorical aspects of documents.

Rhetorical features
Previous work used RST features to analyze the quality of discourse, to assess writing abilities (Wang et al., 2019;Zhang, 2013), examine linguistic coherence (Feng et al., 2014;Abdalla et al., 2017), and to analyze arguments (Chakrabarty et al., 2019). In this project, we extract similar RST features in the following three categories: Discourse relation occurrences (Sig) We include the number of relations detected in each document. There are 18 relations in this category 1 . Unfortunately, the relations adopted by open-source RST parsers are not unified. To allow for comparison against other parsers, we do not differentiate subtle differences between relations, therefore grouping very similar relations, following the approach in (Feng and Hirst, 2012). (E.g., we consider both Topic-Shift and Topic-Drift to be a Topic-Change). Specifically, this approach does not differentiate between the sequence of nucleus and satellite (e.g., NS-Evaluation and SN-Evaluation are both considered as an Evaluation).
Tree property features (Tree) We compute the depth and the Yngve depth (the number of rightbranching in the tree) (Yngve, 1960) of each tree node, and include their mean and variance as characteristic features, following previous work extracting tree linguistic features (Li et al., 2019;Zhu et al., 2019).
EDU related features (EDU) We include the mean and variance of EDU lengths of each document. We hypothesize the longer EDUs indicate higher levels of redundancy in discourse, hence extracting rhetorical features require memory across longer spans.
Overall, there are 24 features from three categories. We normalize them to zero mean and unit variance, and take these RST features for probing. The features are not independent of each other. Specifically, the features of each group tend to describe the same property from different aspects. 2 1 The 18 relations are: Attribution, Background, Cause, Comparison, Condition, Contrast, Elaboration, Enablement, Evaluation, Explanation, Joint, Manner-Means, Topic-Comment, Summary, Temporal, Topic-Change, Textualorganization, and Same-unit.
2 For example, Sig features describe the composition of the document in a histogram. For the same document, if a relation is changed, e.g., from Contrast to Attribution, then the occurrence of both Contrast and Attribution are affected. Figure 2: RST relation occurrences per document. RST-DT contain longer documents than IMDB on average. However, the distributions of frequencies between these two datasets are relatively consistent, with Elaboration, Joint, and Attribution the most frequent signals.

Probe
Our probing method contains two weight parameters, W d and W p . First, we embed a document with L tokens using a neural LM with D dimensions to get a raw representation matrix X ∈ R L×D . We use a projection matrix W d ∈ R D×d to reduce the embedding dimension from D (e.g., D = 768 for BERT and 2048 for XLM) to a much smaller one, d. Then, we use self attention similar to Lin et al. (2017) to collect the information spread across the document to a condensed form: We flatten A into a vector with fixed size:Ã = (d 2 , 1). We use a probing matrix W p ∈ R d 2 ×m to extract RST features v ∈ R m from attention, normalize them to zero mean and unit variance, and optimize based on the expected L 2 error: Note that the reduction from D to d using W p is necessary, because it significantly lowers the number of parameters of the probing model. If there were no W d (i.e., d = 768), then W p alone would require 768 2 m parameters to probe m features. Now, we let d = 10, then W d and W p combined have D × d + d 2 m ≈ 7680 + 100m parameters. Considering m ∈ O(10 1 ), the total parameter size is reduced from O(10 6 ) to O(10 3 ).
There is one more step before we can use this loss to measure the difficulty of probing rhetorical features. L 2 error scales linearly with the dimension of features m, so it is necessary to normalize the L 2 error by m, to ensure that the losses can be compared across linguistic feature sets. The difficulty of probing a group of m features v ∈ R m therefore is: IMDB contains 50,000 movie reviews without discourse annotations. In these reviews, the authors explain and elaborate upon their opinions towards certain movies and give ratings. We removed html tags, and attempt to parse all of them (i.e., both train and test data) using a two-pass parser from Feng and Hirst (2014). We discarded 1,977 documents that the RST parser generate ill-formatted trees 3 . Of the remaining documents, we additionally filtered out those with sequence lengths greater than 512 tokens 4 , resulting in 40,833 documents. After parsing each document into an "RST-tree", we extracted the features mentioned in Section 2.1 from these parsed trees. Figure 2 shows the occurrence of the 18 RST relations per document, and Table 1 shows the statistics of remaining 6 features. In addition, we include several examples of parsed RST trees in Appendix.
language models come with their own tokenizers. Note that RoBERTa adds two special tokens, so this threshold becomes 510 for RoBERTa.

Language models
We considered the following popular neural LMs: • BERT BASE (Devlin et al., 2019) This LM with 110M parameters is built with 12-layer Transformer encoder (Vaswani et al., 2017) with 768 hidden dimensions. It is trained with masked LM (i.e., cloze) and next sentence prediction objectives using 16GB text. • BERT-multi (Wolf et al., 2019) Same as BERT, BERT-multi is also a 12-layer Transformer encoder with 768 hidden dimensions and 110M parameters. Its difference from BERT is that, BERT-multi is trained on top 104 languages with the largest Wikipedia. • RoBERTa (Liu et al., 2019b) is an enhanced version of BERT with the same architecture, similar masked LM objectives, and 10 times larger training corpus (over 160GB).  parameters. The XLNet we use is trained on 33GB texts using the "permutation language modeling" objective, with its LM factorization according to shuffled orders, but its positional encoding correspond to the original sequence order. The permutation LM objective introduces diversity and randomness to the context.
To make comparisons between models fair, we limit to 12-layer neural LMs. The models are pretrained by Huggingface (Wolf et al., 2019).

Implementation
We formulated probing as an optimization problem, and implemented our solution with PyTorch  Figure 3, neural LMs encode RST features in different manners, depending on their structures. In general, for BERT-based models, features seem to distribute evenly across layers. On GPT-2 and XLNet, lower layers seem to encode slightly more EDU and Sig features than higher levels, whereas Tree features seem to be more concentrated in layers 2-6. The results on XLM are relatively noisy, possibly because the uni-language version does not benefit from the performance boost of crosslanguage modeling.
Contrasting with previous work that suggested our results indicate a less definitive localization for discourse features, except for the first and final layers. We suggest that the reason they encode less discourse information is that the first layer focuses on connections between "locations", while the final layer focuses on extracting representations most relevant to the final task.
Are RST features equally hard to probe? Figure 3 also shows the difficulty in probing features across feature sets. In BERT-based models, EDU and Tree features are comparably easier to probe, whereas the Sig feature groups is more challenging. However, GPT-2, XLNet, and XLM do not regard EDU or Tree features easier to probe than other groups. Nevertheless, the results on all features correlate more to the Sig features.
How about averaging layers? For comparison, we also used the mean of all 12 layers for each neural LM. Figure 5 shows the probing results. Except GPT-2, other LMs show similar performances when the representations of layers are averaged. In addition, the performances show that Sig features are harder to probe than Tree and EDU features, whereas the aggregation task (using all features) appears harder than each of its three component feature groups.

Deconstructing the probe
We perform ablation studies to illustrate the effectiveness of probing, deconstructing the language model probe step-by-step. First, we get rid of the contextualization component in language modeling by using non-contextualized word embeddings, GloVe and FastText. Then, we discard the semantic component of word embedding by mapping tokens to randomly generated vectors (RandEmbed). Finally, we remove all information pertaining to the text, leading to a random predictor for RST features, RandGuess.
Non-contextualized word embeddings We consider two popular word embeddings here: • GloVe (Pennington et al., 2014) contains 2.2M vocabulary items and produces 300dimensional word vectors. The GloVe embedding we use is pretrained on Common Crawl.
Word embeddings map each token into a Ddimensional semantic space. Therefore, for a document of length L, the embedded matrix also has shape L × D. The difference from the contextualized neural LMs is that, the D-dimensional vectors of every word do not depend on their contexts.
Random embeddings In this step, we assign a non-trainable random embedding vector per token in the vocabulary. This removes the semantic information encoded by GloVe and FastText word embeddings. As shown in Figure 4, 5: RandEmbed is worse than GloVe and FastText (except for GloVe in Sig features task). This verifies some semantic information is preserved in word embeddings.
Contextualized LMs against baseline First, the lack of context restrict the probing performance of non-contextualized baselines. They are worse than most layers in contextualized LMs (in Figure 4), and are worse than all except GPT-2 if we average the layers (in Figure 5).
Second, it is impossible for any LM to have a "negative" rhetorical capacity. If the probing loss is worse than RandEmbed baseline, that means the RST probe can not detect rhetorical features of the given category encoded in the representations. This is what happens in some layers of GPT-2, XLM, and XLNet, and the mean of all layers of GPT-2.
Random guesser To measure the capacity of baseline embeddings, we set up a random guesser as a "baseline-of-baseline". The random guesser outputs the arithmetic mean of RST features plus a small Gaussian noise (with s.d. σ ∈ {0, 0.01, 0.1, 1.0}) The output of RandGuess is completely independent of the discourse. As shown in Table 2, the best of the four random guessers is much worse than any of the three word embedding baselines, which is expected.

Why are some LMs better?
From probing experiments (Figure 3, 4, and 5) we can see that BERT-based LMs have slightly better rhetorical capacities than XLNet, and much better capacities than GPT-2. We present two hypotheses as following.

Rhetorics favor contexts from both directions
BERT-based LMs use Transformer encoders, whereas GPT-2 use Transformer decoders. Their main difference is that a Transformer encoder considers contexts from both "past" and "future", while a Transformer decoder only conditions on the context from the "past" (Vaswani et al., 2017). GPT-2 attends to uni-directional contexts. Apparently both the "past" and "future" context would contribute to the rhetorical features of words. Without "future" contexts, GPT-2 would encode less rhetorical information.
Random permutation makes encoded rhetorics harder to decode The difference between XL-Net and other LMs is the permutation in context. While permutation increases the diversity in discourse, they could also bring in new meaning to the texts. For example, the sentence in Figure 1 ("I didn't know this is from C, but it is very good!") has several syntactically plausable factorization sequences: • I didn't know C ... • ... this is C ... • I know it is very good ...
• I didn't know this is good ... • ... didn't this C good ... Apparently such diversity in contexts makes the upper layers of XLNet contain harder-to-decode rhetorical features. If we average the representations of all layers, XLNet has larger variance than BERT-based LMs. We hypothesize that larger layer-wise difference is a factor of such instability for averaged representations.

Limitations
RST probing is not perfect. While we designed our comparisons to be rigorous, there are still several limitations to the RST probe, described below.
• RST signals are noisy. The RST relation classification task is less defined than established tasks like POS tagging. Humans tend to disagree with the annotators, resulting in a merely 65.8% accuracy in relation classification (i.e., the task introduced by Marcu (2000)). Regardless, state-of-the-art discourse parsers currently have performances slightly higher than 60% (Feng and Hirst, 2014;Ji and Eisenstein, 2014;Wang et al., 2017

Related work
Recent work has considered the interpretability of contextualized representations. For example, Jain and Wallace (2019) found attention to be uncorrelated to gradient-based feature importance, while Wiegreffe and Pinter (2019) suggested such approaches allowed too much flexibility to give convincing results. Similarly, Serrano et al. (2019) considered attention representations to be noisy indicators of feature importance. Many tasks in argument mining, similar to our task of examining neural LMs, require understanding the rhetorical aspects of discourse (Lawrence and Reed, 2019). This allows RST to be applied in relevant work. For example, RST enables understanding and analyzing argument struc-tures of monologues (Peldszus and Stede, 2016) and, when used with other discourse features, RST can improve role-labelling in online arguments (Chakrabarty et al., 2019).
Probing neural LMs is an emergent diagnostic task on those models. Previous work probed morphological (Bisazza andTump, 2018), agreement (Giulianelli et al., 2018), and syntactic features (Hewitt and Manning, 2019). Hewitt and Liang (2019) compared different probes, and recommended linear probes with as few parameters as possible, for the purpose of reducing overfitting. Recently, Pimentel et al. (2020) argued against this choice from an information-theoretic point of view. Voita and Titov (2020) presents an optimization goal for probes based on minimum description length. Liu et al. (2019a) proposed 16 diverse probing tasks on top of contextualized LMs including token labeling (e.g., PoS), segmentation (e.g., NER, grammatical error detection) and pairwise relations. While LMs augmented with a probing layer could reach state-of-the-art performance on many tasks, they found that LMs still lacked fine-grained linguistic knowledge. DiscoEval (Chen et al., 2019) showed that BERT outperformed traditional pretrained sentence encoders in encoding discourse coherence features, which our results echo.

Conclusion
In this paper, we propose a method to quantitatively analyze the amount of rhetorical information encoded in neural language models. We compute features based on Rhetorical Structure Theory (RST) and probe the RST features from contextualized representations of neural LMs. Among six popular neural LMs, we find that contextualization helps to generally improve the rhetorical capacities of LMs, while individual models may vary in quality. In general, LMs attending to contexts from both directions (BERT-based) encode rhetorical knowledge in a more stable manner than those using unidirectional contexts (GPT-2) or permuted contexts (XLNet).
Our method presents an avenue towards quantitatively describing rhetorical capacities of neural language models based on unlabeled, target-domain corpus. This method may be used for selecting suitable LMs in tasks including rhetorical acts classifications, discourse modeling, and response generation. A Experiments on RST-DT As a sanity check, we include experiments on RST-DT (Carlson et al., 2001) corpus with the same preprocessing and feature extraction procedures (i.e., perform feature extraction and embedding on the article level, and ignoring the overlength articles). As shown in Figure 6, BERT-family and XLM outperform GPT-2 and XLNet. Also, the noncontextualized embedding baselines show worse performances than contextualized embeddings in general, with some exceptions (e.g., GPT-2 on EDU features). These are similar to the IMDB results.
What are different is that the probing losses of RST-DT are lower than the IMDB experiments in general. We consider two possible explanations. First, the IMDB signals contain more noise, so that probing rhetorical features from IMDB would be naturally more difficult than probing from the RST-DT dataset. Second, it is possible that the probes overfit the much smaller RST-DT dataset.

B Examples of parse trees
We include several examples of IMDB parse trees in Appendix here, including some examples where the RST parser makes mistakes on a new domain, movie review. For clarity of illustration, these examples are among the shorter movie reviews. More parse trees can be generated by our visualization code, which is contained in our submitted scripts.  myself rated a 10/10 I would highly recommend people to watch this movie . <P> Figure 8: IMDB train/pos/11857 10.txt. There is an EDU segmentation error: the "I" is incorrectly assigned to the previous sentence "This movie overall was a really good movie". Apparently some lexical cues the EDU segmentator relies on (e.g., sentence finishes with a period sign) is not always followed in IMDB. Also the film is funny is several parts . <s> I also liked how the evil guy was portrayed too . <s> Figure 9: IMDB train/pos/1000 8.txt. The parser captures the key sentence of this review. All sentences following the first one act as reasons to explain how the reviewer liked the film.
God well and even chips in a few jokes that are surprisingly funny . <s> It contains one or two romantic moments that are a bit boring but over all a great movie with some funny scenes . <s> The best scene in , it is where Jim is messing up the anchor man 's voice . <s> Figure 10: IMDB train/pos/10301 8.txt. The interjection, "well", is incorrectly identified as the satellite of the summary signal. This is likely caused by the discrepancy between the train (RST-DT) and test (IMDB) corpus discrepancy for the RST parser. The RST-DT dataset contains news articles, which are more formal than the online review in IMDB. The term "well" is therefore more likely to be identified as other senses.  Figure 11: IMDB train/pos/11825 8.txt. One might suggest that the last EDU could be moved one level higher (so that it summarizes the whole review), but this parsing is also reasonable, since the mention of kids elaborates the descriptions of the makeup and the views.  Figure 12: IMDB train/pos/10788 10.txt. This is an example of the EDU segmentation contains mistake. The "i wish i" should be merged with the subsequent EDU, "could live in big rock candy mountain". Note that the sentence starts with two lowercase "i" (which should be uppercase). The non-standard usages like these are unique for less formal texts like IMDB.