Evaluating Discourse in Structured Text Representations

Discourse structure is integral to understanding a text and is helpful in many NLP tasks. Learning latent representations of discourse is an attractive alternative to acquiring expensive labeled discourse data. Liu and Lapata (2018) propose a structured attention mechanism for text classification that derives a tree over a text, akin to an RST discourse tree. We examine this model in detail, and evaluate on additional discourse-relevant tasks and datasets, in order to assess whether the structured attention improves performance on the end task and whether it captures a text’s discourse structure. We find the learned latent trees have little to no structure and instead focus on lexical cues; even after obtaining more structured trees with proposed model modifications, the trees are still far from capturing discourse structure when compared to discourse dependency trees from an existing discourse parser. Finally, ablation studies show the structured attention provides little benefit, sometimes even hurting performance.


Introduction
Discourse describes how a document is organized, and how discourse units are rhetorically connected to each other. Taking into account this structure has shown to help many NLP end tasks, including summarization (Hirao et al., 2013;Durrett et al., 2016), machine translation (Joty et al., 2017), and sentiment analysis (Ji and Smith, 2017). However, annotating discourse requires considerable effort by trained experts and may not always yield a structure appropriate for the end task. As a result, having a model induce the discourse structure of a text is an attractive option. Our goal in this paper is to evaluate such an induced structure.
Inducing structure has been a recent popular approach in syntax (Yogatama et al., 2017;Choi et al., 2018;Bisk and Tran, 2018). Evaluations of these latent trees have shown they are inconsistent, shallower than their explicitly parsed counterparts (Penn Treebank parses) and do not resemble any linguistic syntax theory (Williams et al., 2018).
For discourse, Liu and Lapata (2018) (L&L) induce a document-level structure while performing text classification with a structured attention that is constrained to resolve to a non-projective dependency tree. We evaluate the document-level structure induced by this model. In order to compare the induced structure to existing linguisticallymotivated structures, we choose Rhetorical Structure Theory (RST) (Mann and Thompson, 1988), a widely-used framework for discourse structure, because it also produces tree-shaped structures. 2 We evaluate on some of the same tasks as L&L, but add two more tasks we theorize to be more discourse-sensitive: text classification of writing quality, and sentence order discrimination (as proposed by Barzilay and Lapata (2008)).
Our research uncovers multiple negative results. We find that, contrary to L&L, the structured attention does not help performance in most cases; further, the model is not learning discourse. Instead, the model learns trees with little to no structure heavily influenced by lexical cues to the task. In an effort to induce better trees, we propose several principled modifications to the model, some of which yield more structured trees. However, even the more structured trees bear little resemblance to ground truth RST trees. We conclude the model holds promise, but re-  Figure 1: Model of Liu and Lapata (2018) with the document-level portion (right) that composes sentences into a document representation. quires moving beyond text classification, and injecting supervision (as in Strubell et al. (2018)).
Our contributions are (1) comprehensive performance results on existing and additional tasks and datasets showing document-level structured attention is largely unhelpful, (2) in-depth analyses of induced trees showing they do not represent discourse, and (3) several principled model changes to produce better structures but that still do not resemble the structure of discourse.

Rhetorical Structure Theory (RST)
In RST, coherent texts consist of minimal units, which are linked to each other, recursively, through rhetorical relations (Mann and Thompson, 1988). Thus, the goal of RST is to describe the rhetorical organization of a text by using a hierarchical tree structure that captures the communicative intent of the writer. An RST discourse tree can further be represented as a discourse dependency tree. We follow the algorithm of Hirao et al. (2013) to create an unlabelled dependency tree based on the nuclearity of the tree.

Models
We present two models: one for text classification, and one for sentence ordering. Both are based on the L&L model, with a design change to cause stronger percolation of information up the tree (we also experiment without this change).
Text classification The left-hand side of Figure 1 presents an overview of the model: the model operates first at the sentence-level to create sentence representations, and then at the document-level to create a document representation from the previously created sentence representations. In more detail, the model composes GloVe embeddings (Pennington et al., 2014) into a sentence representation using structured attention (from which a tree can be derived), then sentence representations into a single document representation for class predic-tion. At both sentence and document level, each object (word or sentence, respectively) attends to other objects that could be its parent in the tree. Since the sentence and document-level parts of the model are identical, we focus on the document level ( Figure 1, right), which is of interest to us for evaluating discourse effects.
Sentence representations s 1 , . . . , s t are fed to a bidirectional LSTM, and the hidden representations [h 1 , . . . , h t ] consist of a semantic part (e t ) and a structure part (d t ): [e t , d t ] = h t . Unnormalized scores f ij representing potentials between parent i and child j are calculated using a bilinear function over the structure vector: The matrix-tree theorem allows us to compute marginal probabilities a ij of dependency arcs under the distribution over non-projective dependency trees induced by f ij (details in Koo et al. (2007)). This computation is fully differentiable, allowing it to be treated as another neural network layer in the model. We importantly note the model only uses the marginals. We can post-hoc use the Chu-Liu-Edmonds algorithm to retrieve the highest-scoring tree under f , which we call f best (Chu and Liu, 1965;Edmonds, 1967). The semantic vectors of sentences e are then updated using this attention. Here we diverge from the L&L model: in their implementation, 3 each node is updated based on a weighted sum over its parents in the tree (their paper states both parents and children). 4 We instead inform each node by a sum over its children, more in line with past work where information more intuitively percolates from children to parents and not the other way (Ji and Smith, 2017) (we also run experiments without this design change). We calculate the context for all possible children of that sentence as: where a ik is the probability that k is the child of i, and e k is the semantic vector of the child. The children vectors are then passed through a non-linear function, resulting in the updated semantic vector e i for parent node i.  Finally, a max pooling layer over e i followed by a linear layer produces the predicted document class y. The model is trained with cross entropy loss. Additionally, the released L&L implementation has a bug where attention scores and marginals are not masked correctly in the matrix-tree computation, which we correct. Sentence order discrimination This model is identical, except for task-specific changes. The goal of this synthetic task, proposed by Barzilay and Lapata (2008), is to capture discourse coherence. A negative class is created by generating random permutations of a text's original sentence ordering (the positive class). A coherence score is produced for each positive and negative example, with the intuition that the originally ordered text will be more coherent than the jumbled version. Because we compare two examples at a time (original and permuted order), we modify the model to handle paired inputs and replace cross-entropy loss with a max-margin ranking loss.

Experiments
We evaluate the model on four text classification tasks and one sentence order discrimination task.

Datasets
Details and statistics are included in Appendix A. 5 Yelp (in L&L, 5-way classification) comprises customer reviews from the Yelp Dataset Challenge (collected by Tang et al. (2015)). Each review is labeled with a 1 to 5 rating (least to most positive). Debates (in L&L, binary classification) are transcribed debates on Congressional bills from the U.S. House of Representatives (compiled by Thomas et al. (2006), preprocessed by Yogatama 5 Of the document-level datasets used in L&L (SNLI was sentence-level), we omit IMDB and Czech Movies because on IMDB their model did not outperform prior work, and Czech (a language with freer word order than English) highlighted the non-projectivity of their sentence-level trees, which is not the focus of our work. and Smith (2014)). Each speech is labeled with 1 or 0 indicating whether the speaker voted in favor of or against the bill.
Writing quality (WQ) (not in L&L, binary classification) contains science articles from the New York Times (extracted from Louis and Nenkova (2013)). Each article is labeled as either 'very good' or 'typical' to describe its writing quality. While both classes contain well-written text, Louis and Nenkova (2013) find features associated with discourse including sentiment, readability, along with PDTB-style discourse relations are helpful in distinguishing between the two classes.
Writing quality with topic control (WQTC) (not in L&L, binary classification) is similar to WQ, but controlled for topic using a topic similarity list included with the WQ source corpus. 6 Wall Street Journal Sentence Order (WSJSO) (not in L&L, sentence order discrimination) is the WSJ portion of PTB (Marcus et al., 1993).

Settings
For each experiment, we train the model four times varying only the random seed for weight initializations. The model is trained for a fixed amount of time, and the model from the timestep with highest development performance is chosen. We report accuracies on the test set, and tree analyses on the development set. Our implementation is built on the L&L released implementation, with changes as noted in Section 3. Preprocessing and training details are in Appendix A.

Results
We report accuracy (as in prior work) in Table 1, and perform two ablations: removing the structured attention at the document level, and removing it at both document and sentence levels. Additionally, we run experiments on the original code   Table 6). See Table 4 for gold statistics on WQTC.
without the design change or bug fix to confirm our findings are similar (see L&L(orig) in Table 1).   Trees do not learn discourse. Although document level structured attention provides little benefit in performance, we probe whether the model could still be learning some discourse. We visually inspect the learned f best trees and in Table 2 we report statistics on them (see Appendix Table  6 for similar results with the original code). The visual inspection (Figure 2) reveals shallow trees (also reported in L&L), but furthermore the trees have little to no structure. 7 We observe an interesting pattern where the model picks one of the first two or last two sentences as the root, and  all other sentences are children of that node. We label these trees as 'vacuous' and the strength of this pattern is reflected in the tree statistics ( Table  2). The height of trees is small, showing the trees are shallow. The proportion of leaf nodes is high, that is, most nodes have no children. Finally, the normalized arc length is high, where nodes that are halfway into the document still connect to the root. We further probe the root sentence, as the model places so much attention on it. We hypothesize the root sentence has strong lexical cues for the task, suggesting the model is instead attending to particular words. In Yelp, reviewers often start or end with a sentiment-laden sentence summarizing their rating. In Debates, speakers begin or end their speech by stating their stance on the bill. In WQ and WQTC, the interpretation of the root is less clear. In WSJSO, we find the root is always the first sentence of the correctly ordered document, which is reasonable and commonly attested in a discourse tree, but the remainder of the vacuous tree is entirely implausible.
To confirm our suspicion that the root sentence is lexically marked, we measure the association between words appearing in the root sentence and those elsewhere by calculating their positive pointwise mutual information scores (Table 3).
In Yelp, we find root words often express sentiment and explicitly mention the number of stars given ('sterne' in German, or 'uuu' as coined by a particularly prolific Yelper), which are clear indicators of the rating label. For Debates, words express speaker opinion, politeness and stance which are strong markers for the binary voting label. The list for WQ revolves around tech, suggesting the model is learning topics instead of writing quality. Thus, in WQTC we control for topics.

Learning better structure
We next probe whether the structure in L&L can be improved to be more linguistically appropriate, while still performing well on the end task. Given that structured attention helps only on WQTC and  learns vacuous trees less frequently, we focus on this task. We experiment with three modifications. First, we remove the document-level biL-STM since it performs a level of composition that might prevent the attention from learning the true structure. Second, we note equation 3 captures possible children only at one level of the tree, but not possible subtrees. We thus perform an additional level of percolation over the marginals to incorporate the children's children of the tree. That is, after equation 4, we calculate: Third, the max-pooling layer gives the model a way of aggregating over sentences while ignoring the learned structure. Instead, we propose a sum that is weighted by the probability of a given sentence being the root, i.e., using the learned root attention score a r i : y i = n i=1 a r i e i . We include ablations of these modifications and additionally derive RST discourse dependency trees, 8 collapsing intrasentential nodes, as an approximation to the ground truth.
The results (Table 4) show that simply removing the biLSTM produces trees with more structure (deeper trees, fewer leaf nodes, shorter arc lengths, and less vacuous trees), confirming our intuition that it was doing the work for the structured attention. However, it also results in lower performance. Changing the pooling layer from max to weighted sum both hurts performance and results in shallower trees (though still deeper than Full), which we attribute to this layer still being a pooling function. Introducing an extra level of tree percolation yields better trees but also a drop in performance. Finally, using 4 levels of percola-tion both reaches the accuracy of Full and retains the more structured trees. 9 We hypothesize accuracy doesn't surpass Full because this change also introduces extra parameters for the model to learn.
While our results are a step in the right direction, the structures are decidedly not discourse when compared to the parsed RST dependency trees, which are far deeper with far fewer leaf nodes, shorter arcs and no vacuous trees. Importantly, the tree statistics show the structures do not follow the typical right-branching structure in news: the trees are shallow, nodes often connect to the root instead of a more immediate parent, and the vast majority of nodes have no children.
In work concurrent to ours, Liu et al. (2019) proposes a new iterative algorithm for the structured attention (in the same spirit as our extra percolations) and applies it to a transformer-based summarization model. However, even these induced trees are not comparable to RST discourse trees. The induced trees are multi-rooted by design (each root is a summary sentence) which is unusual for RST; 10 their reported tree height and edge agreement with RST trees are low.

Conclusion
In this paper, we evaluate structured attention in document representations as a proxy for discourse structure. We first find structured attention at the document level is largely unhelpful, and second it instead captures lexical cues resulting in vacuous trees with little structure. We propose several principled changes to induce better structures with comparable performance. Nevertheless, calculating statistics on these trees and comparing them to parsed RST trees shows they still contain no meaningful discourse structure. We theorize some amount of supervision, such as using ground-truth discourse trees, is needed for guiding and constraining the tree induction. Duyu Tang, Bing Qin, and Ting Liu. 2015

A Appendices
Datasets Statistics for the datasets are listed in Table 5. For WQ, the very good class was created by Louis and Nenkova (2013) using as a seed the 63 articles in the New York Times corpus (Sandhaus, 2008) deemed to be high-quality writing by a team of expert journalists. The class was then expanded by adding all other science articles in the NYT corpus that were written by the seed authors (4,253 articles). For the typical class, science articles by all other authors were included (19,520). Because the data is very imbalanced, we undersample the typical class to be the same size as the very good. We split this data into 80/10/10 for training, development and test, with both classes equally represented in each partition.
For WQTC, the original dataset authors provide a list of the 10 most topically similar articles for each article. 11 We make use of this list to explicitly sample topically similar documents.
Preprocessing For Debates and Yelp, we follow the same preprocessing steps as in L&L, but do not set a minimum frequency threshold when creating the word embeddings. For our three datasets, sentences are split and tokenized using Stanford Core NLP.
Training For all models, we use the Adagrad optimizer with a learning rate of 0.05. For WQ, WQTC, and WSJSO, gradient clipping is performed using the global norm with a ratio of 1.0. The batch size is 32 for all models except WSJSO uses 16. All models are trained for a maximum of 8 hours on a GeForce GTX 1080 Ti card.
Results Because our results hinge on multiple runs of experiments each initialized with different random weights, we include here more detailed versions of our results to more accurately illustrate their variability. Table 6 supplements Table 2 with tree statistics from L&L(orig), the model without the design change or bug fix, to illustrate the derived trees on this model are similar. Finally, Table  7 is a more detailed version of Table 4, which additionally includes maximum accuracy, standard deviation for accuracy, as well as the average parent entropy calculated over the latent trees.    Table 7: Max | mean (standard deviation) test accuracy and tree statistics of the WQTC dev set (averaged across four training runs with different initialization weights). Bolded numbers are within 1 standard deviation of the best performing model. +w uses the weighted sum, +p adds 1 extra level of percolation, +4p adds 4 levels of percolation. The last row are the ('gold') parsed RST discourse dependency trees.