A Centering Approach for Discourse Structure-aware Coherence Modeling

Previous neural coherence models have focused on identifying semantic relations be-tween adjacent sentences. However, they do not have the means to exploit structural information. In this work, we propose a coherence model which takes discourse structural information into account without relying on human annotations. We approximate a linguistic theory of coherence, Centering theory, which we use to track the changes of focus between discourse segments. Our model ﬁrst identiﬁes the focus of each sentence, recognized with regards to the context, and constructs the structural relationship for discourse segments by tracking the changes of the focus. The model then incorporates this structural information into a structure-aware transformer. We evaluate our model on two tasks, automated essay scoring and assessing writing quality. Our results demonstrate that our model, built on top of a pretrained language model, achieves state-of-the-art performance on both tasks. We next statistically examine the identiﬁed trees of texts assigned to different quality scores. Finally, we investigate what our model learns in terms of theoretical claims 1 .


Introduction
Coherence describes the semantic relation between elements of a text. It identifies a text passage as either a unified whole or a collection of unrelated sentences. The most well-known formal theory, Centering theory, determines the most salient item in each sentence, the center or focus, and tracks the changes of the focus (Grosz et al., 1995). Prior studies of coherence have mainly focused on modeling local coherence in Centering theory (Barzilay and Lapata, 2008). They aim to identify the semantic relations between adjacent sentences. However, coherence arises not only at the local level, but also at the document level giving insight into the structure of the discourse.
Discourse structure represents the semantic organization of a text. Incorporating structural information into the model has been beneficial for diverse downstream tasks including text summarization (Marcu, 2000), translation (Guzmán et al., 2014), sentiment analysis (Bhatia et al., 2015), and text classification (Ji and Smith, 2017).
To identify discourse structure, earlier work adopts a supervised approach, relying on human annotations (Hernault et al., 2010;Wang et al., 2017). However, annotating discourse structure is time consuming and costly. It requires annotators to understand not only the local context surrounding the target sentence but also higher level relations. Learning latent structure has been proposed to alleviate this limitation. This approach induces the discourse structure from a text without annotations using an attention layer (Liu and Lapata, 2018). Recent work argues that, however, the learned trees have mostly little to no structure at the document level, and the model relies on specific linguistic cues (Ferracane et al., 2019).
In this paper, we propose a coherence model inspired by Centering theory which takes structural information into consideration. Our model does not rely on human annotations to identify this information. Our model consists of two components: (1) a discourse segments parser which constructs structural relationship for discourse segments by tracking the changes of the focus between discourse segments, and (2) a structure-aware transformer which exploits structural information to update sentence representations.
The discourse segments parser first identifies the hierarchical discourse segments of a text, building upon an approximation of Centering theory (Grosz et al., 1995). This theory first defines three data structures to describe the focus of a sentence, a list of forward-looking centers (Cf ), the preferred center (Cp), and a single backward-looking center (Cb). Cf indicates the salient items of the sentence, that are candidates of the focus in the next sentence, and Cp indicates the most preferred item of Cf. Cb describes the focus of a sentence with regards to the previous context. The theory also defines centering transitions to describe the changes of focus by comparing two centers, Cp and Cb. We propose an algorithm to approximate this theory using a pretrained language model. Our algorithm first identifies the focus of sentences using multihead attention scores provided by the pretrained language model and semantic similarity between vector representations. Our algorithm then constructs hierarchical discourse segments using a focus stack -inspired by the concept of Grosz and Sidner (1986) -to track the changes of the focus between discourse segments.
Secondly, we propose a structure-aware transformer to account for structural information. Vaswani et al. (2017) introduce the transformer, a model solely based on a self-attention mechanism. This mechanism relates all items to capture semantic relations in a sequence. In contrast, the self-attention of our transformer is restricted to considering sentences with regards to the identified hierarchical discourse segments. We first calculate document structure priors to allow self-attention to relate sentences connected in the identified structure. Then, the document structure attention is calculated by element-wise multiplication of the document structure priors and the self-attention of a naive transformer. We evaluate our model on two tasks: automated essay scoring (AES) and assessing writing quality (AWQ). AES is the task of assigning a score for a given essay, aiming to replicate human scoring results (Dong and Zhang, 2016). This task has been used to evaluate coherence models (Burstein et al., 2010). Secondly, AWQ is the task of assigning labels of text quality recognized by human annotators. Coherence is one of the most essential aspects of text quality (Feng et al., 2014). We first show that a simple fine-tuned model, relying on a pretrained language model, outperforms the state of the art on both tasks. We then demonstrate that our model achieves state-of-the-art performance on both tasks. Our results indicate that the identified trees let the model assess text quality better by  structure-aware coherence modeling. We then examine the identified trees to investigate differences of texts in writing quality. We finally inspect identified centers to investigate what our model learns in terms of theoretical claims.

Related Work
While unsupervised approaches for discourse parser have been developed (Marcu and Echihabi, 2002;Ji et al., 2015), earlier work mostly adopted a supervised approach to identify discourse structure relying on human annotations. Subba and Di Eugenio (2009) incorporate various linguistic features, including compositional semantics and part-of-speech information, to propose a discourse parser based on Inductive Logic Programming. Hernault et al. (2010) introduce a discourse parser which constructs discourse structure from a full input text. They train classifiers to identify discourse relations, and use them to build a tree structure of an input text. Feng and Hirst (2012) improve the tree building algorithm of this system by incorporating more linguistic features. Wang et al. (2017) introduce an SVM-based model that consists of two stages, one identifying discourse structure, and the other classifying types of relations between units. More recently, neural models have been developed to recognize discourse structure. Li et al.  Thompson, 1988). This theory represents a document as a tree structure built by connecting discourse units recursively through predefined discourse relations. Another line of work is based on the Penn Discourse Treebank (Webber et al., 2019), which annotates discourse structure in a lexically-grounded approach. These studies represent discourse structure with discourse relations. Unlike these studies, our model does not consider discourse relations but we investigate Centering theory to take structural relationships for discourse segments into account.
A supervised approach requires annotations for each task. To overcome the lack of a labeled dataset, recent work has investigated to learn latent structures, which induce the tree structure directly from a text. While Yogatama et al. (2017) and Choi et al. (2018) induce structure at the sentence level to learn syntax, Liu and Lapata (2018) propose a neural model which induces structural information without a labeled resource. They induce the non-projective dependency structure from a text by structured attention. More recently, however, Ferracane et al. (2019) claim that induced document-level structures do neither match human intuitions nor align with linguistic theories. Unlike latent structure learning, we identify hierarchical discourse segments using a pretrained language model. It lets our model identify the focus of a sentence by comparing semantic similarities between representations of sentences without relying on a resource of manually labeled discourse structure. Figure 1 presents the architecture of our coherence model. We first introduce input representations at the sentence level using a pretrained language model. We then describe the algorithm of the discourse segments parser. Finally, we present a structure-aware transformer and the document representation created.

Sentence Representations
We use a pretrained language model to obtain representations of sentences. In this work, we employ XLNet for the pretrained language model (Yang et al., 2019). XLNet not only outperforms BERT (Devlin et al., 2019), XLNet also has the advantage to model coherence because of its training objective. XLNet maximizes the expected likelihood over all permutations in the training.
We first encode an input document using XL-Net to produce word representations. We obtain sentence representations by averaging all word representations in a sentence. We then feed the sentence representations to the discourse parser and the structure-aware transformer.

Discourse Segment Parser
Our discourse segment parser is inspired by Centering theory Grosz et al. (1995). We modify Centering theory to approximate it in a neural model. The theory considers entities as candidates of centers.
To determine centers at the phrase level or the entity level, we would need to incorporate an external parser into the model to identify phrases or entities. The performance of the model then crucially would rely on how accurately the external parser would identify them. Hence, we determine centers at the word level so that our model is not affected by the performance of an external parser. Figure 2 gives an overview of our approach to identify the focus of sentences. To represent the focus of a sentence, we model the backwardlooking center and forward-looking centers using scores computed by multi-head self-attention in XLNet. Recent work shows that multi-head attention of a pretrained language model represents important linguistic notions of the input sequence (Clark et al., 2019;Vig and Belinkov, 2019;Sen et al., 2020). It also claims that self-attention might be biased to specialized tokens used in training, <SEP>, <CLS> and the token of a punctuation mark, hence we only consider actual items by filtering these tokens. Following previous work, we use the averaged scores of the multi-head self-attention extracted from the last layer of the model. To identify the salient items of sentences, we encode each sentence separately to identify centers of the sentence.
To determine the forward-looking centers of the sentence at the word level, we extract diagonal elements of the matrix representing multi-head selfattention of the encoded sentence. We then select the top-k vectors obtained by XLNet as the forwardlooking centers in the extracted elements. The preferred center of a sentence is the top-1 item in the forward-looking centers. The backward-looking center of a sentence is the item most related to one of the forward-looking centers of the immediately preceding sentence (Brennan et al., 1987). We compare semantic similarity between the averaged word representations of the current sentence and each forward-looking center of the immediately preceding sentence. We use cosine similarity to measure semantic similarity.
Previous work introduces concepts to describe the changes of focus. Grosz et al. (1995) describe three types of centering transitions: Continue, Retain, and Shifting, as shown in Table 1. Continue maintains the current focus, and Retain intends to change the focus to an item recognized in the current sentence. Shifting indicates that the focus is different from the previous sentence. Grosz and Sidner (1986) introduce a focus stack which stores discourse segments related to the current focus.
In this work, we propose an algorithm to con-struct the hierarchical discourse segments of a text using these concepts (Algorithm 1). For each sentence, we iterate the process until the focus stack is empty or we find a change of the focus. For Continue, we add the current sentence to the current segment without changing the stack (line 9-10). For Retain, we push the current segment to the stack, which results in connecting the discourse segment of the top item in the stack to the current segment (line 11-13). For Shifting, we pop the discourse segment from the stack, and iterate the process for the next sentence (line 16-17). If the process is completed because of an empty stack, then we push s i as a new segment to process the next sentence (line 20-23). During the process, we build an adjacency matrix to represent the changes of the focus stack. Finally, we connect the adjacent sentences in the discourse segment.
Algorithm 1 The discourse segment parser. 1: procedure PARSER(S, Cb, Cp, t sim ) 2: Seg ← {} A list for the current segment 3: return Adj M at 28: end procedure

Structure-aware Transformer
To take structural information into account, we propose a structure-aware transformer.
Our structure-aware transformer is inspired by the Tree-Transformer (Wang et al., 2019), which updates its hidden representations by inducing a tree-structure from a document. The Tree-Transformer generates constituent priors by calculating neighboring attention which represents the probability whether adjacent items are in the same constituent. The constituent priors constrain the self-attention of the transformer to follow the induced structure. Instead of inducing a tree structure, our model uses input structural information to generate document structure priors, which guide the self-attention of the transformer. The sentences which are not connected in the structure are constrained to not attend each other. Document structure priors are then used to calculate structure-aware attention.
We calculate structure-aware attention scores using the identified hierarchical discourse segments. We compute the score s i,j to relate s i and s j by the scaled dot-product attention: to represent the semantic relation between s i and s j , where q ds is a query matrix and k ds is a key matrix of document structure attention. We represent hierarchical discourse segments by an adjacency matrix. To let the model learn attention with the structural information, we mask scores by the adjacency matrix:Ŝ = mask(S, adj) where adj is the adjacency matrix representing document structure. We apply a softmax function to each row of the score matrix to represent the probability that s i attends to other connected sentences: To make a symmetric matrix, we calculate the structure-aware attention score: a = √ p i,j × p j,i . We follow Wang et al. (2019) to cover more relations at the higher level by applying a hierarchical constraint. This restricts a l k to be larger than a l k − 1 for layer l and sentence index k: We then calculate document structure priors (D i,j ) using a log-sum instead of multiplication to calculate it efficiently: Finally, the attention score (E) of the structureaware transformer is calculated by element-wise multiplication of the document structure priors and the self-attention of a naive transformer: where Q is query vectors, K is key vectors with dimension d k in the naive transformer.

Document Representation
In the last layer of our model, we apply document attention to produce the weighted sum of all the updated sentence representations. The document attention identifies relative weights of updated sentence representations which enables our model to handle any document length. Finally, a feedforward network is applied to the representation to produce the output value.

Implementation Details
We implement our model using the PyTorch library and use the Stanford Stanza library 2 for sentence tokenization. We employ XLNet for the pretrainedlanguage model. For the baselines that do not use the pretrained language model, we use Glove for word embeddings, the pretrained word embeddings trained on Google News (Pennington et al., 2014). We set the top-n for selecting Cf to 3 and the semantic threshold to compare vector representations to 0.945 (see Appendix B for more training details and parameters). Due to memory constraints, we encode each sentence separately using XLNet instead of the whole document at once. Our dataset consists of long documents i.e., journal articles with more than 3,000 tokens. For employing the pretrained model, it is  practically infeasible to encode all words in a document at once due to memory limitations. We use 46GB GPU memory of two NVidia P40s for each run.
We re-implemented all baselines to compare on the same deep-learning framework, PyTorch. We then used our re-implementation to report the performance of models with 10 runs with different random seeds. We verified statistical significance (p-value<0.01) in both a one-sample t-test, which verifies the reproducibility of the performance of each model, and a two-sample t-test, which verifies that the performance of our model is statistically significant compared to other models. To fulfill the request for fairer comparisons between neural models (Dodge et al., 2019), we also report validation performance and standard deviation of the performance (see Appendix D for more details).

Baselines
We first compare against the latent learning model for discourse parsing by Liu and Lapata (2018). While their model induces structure at both the sentence level and the document level, we only induce structure at the document level due to memory constraints for large documents. We then compare against a neural coherence model. Mesgar and Strube (2018) propose a local coherence model inspired by Centering Theory. This model finds the two most similar RNN outputs to determine the most salient part of sentences to connect adjacent sentences. This model is evaluated on the AES task as well as the task of assessing readability.
To investigate the influence of a pretrained language model on this task, we implement two models for baselines. We first develop a simple finetuned model relying on the pretrained language model (Averaged-XLNet). This simple model encodes an input document at the sentence level and averages the encoded representations. We also implement a second model which combines a state-ofthe-art latent tree learning model and the pretrained language model (XLNet+Wang et al. (2019)). This model encodes an input document at the sentence level and updates representations using the Tree-Transformer (Wang et al., 2019). Instead of averaging, document attention is applied to produce a weighted-sum vector representation.
For AES, we also compare against the state of the art for this task. Dong et al. (2017) introduce a model which consists of a convolutional layer followed by a recurrent layer and an attention layer (Bahdanau et al., 2015).

Automated Essay Scoring
Datasets. To examine the effectiveness of our model on AES, we evaluate our model on the Test of English as a Foreign Language (TOEFL) dataset. TOEFL has overall higher quality of essays compared to essays in the frequently used dataset for AES, the Automated Student Assessment Prize (ASAP) dataset 3 . The prompts in ASAP are written by students in grade levels 7 to 10 of US middle schools. Many essays in ASAP consist of only a few sentences. In contrast, the prompts in TOEFL are submitted for the standard English test for the entrance to universities by non-native students. The prompts in TOEFL do not vary so much, the student population is more controlled, and the essays have a similar length (see Appendix A for more details).
Evaluation Setup. We follow the evaluation setup of previous work on AES (Taghipour and Ng, 2016). For TOEFL, we evaluate performance with accuracy for the three-class classification problem with 5-fold cross-validation. We deploy the crossentropy loss for training. We use the ADAM optimizer with a learning rate of 0.003. We evaluate performance for 20 epochs on the validation set. The model which reaches the best accuracy on the validation set is then applied to the test set. We use a mini-batch size of 32 with random shuffle.  To better understand how the model works, we conduct an error analysis. This analysis shows that uneven label distributions cause biased predictions in the model of Liu and Lapata (2018). The TOEFL dataset has an uneven label distribution, 11.0%/54.3%/34.7% for low, mid, and high scores, respectively. In contrast, all models built upon pretrained language models generally predict different scores in an unbiased fashion. XLNet+Wang et al. (2019) shows, however, more bias toward the middle score than Averaged-XLNet. This indicates that, as the model of Liu and Lapata (2018), the baseline model predicts the uneven distribution which leads to better performance. Our model mostly predicts the low and the high score better. This suggests that our model does not take advantage of the uneven distribution but assesses essay quality by modeling coherence.

Assessing Writing Quality
Datasets. Louis and Nenkova (2013) use a dataset of scientific articles from the New York Times (NYT) for assessing writing quality. They assign each article to one of two classes by a semisupervised approach: typical or good. Though articles included in both classes are of good quality generally, Louis and Nenkova (2013) Table 4: Statistics for learned trees as labels by our model described as mean (standard deviation).
They report the performance of this model with an embedding layer trained on the NYT corpus itself 5 . To ensure fair comparison of the model across different datasets, we use a pretrained Glove embedding layer. Table 3 reports performance of models on the NYT test set. The model of Liu and Lapata (2018) with the pretrained Glove embedding layer shows significantly lower performance than the same model with the embedding layer trained on NYT. Averaged-XLNet performs better, which shows that employing a pretrained language model is beneficial, and XLNet+Wang et al. (2019) outperforms this model. Our model achieves state-of-the-art performance on NYT among the models using the pretrained embedding layer, but it still shows lower performance than the model using the embedding layer trained on the target corpus. This suggests that linguistic cues have the potential to improve this model further.

Learned Discourse Structure
We next statistically examine the discourse structure identified by our parser. Ferracane et al. (2019) evaluate the induced structure learned by the model of Liu and Lapata (2018) using four measures: the average height of trees, the proportion of leaf nodes, the normalized arc length, and the ratio of vacuous trees. They define a vacuous tree as a shallow tree whose nodes are connected to the root directly.
We report statistics on the trees identified by our parser as shown in Table 4. We modify two measures, the normalized tree height and the ratio of small trees. We normalize the tree height by the number of nodes to take the length of documents into account. Since there are no vacuous trees in our trees, we report the ratio of small trees, defined as a tree whose normalized tree height is smaller than 0.2 and whose height is smaller than 3. In addition, we report the proportion of the nodes at the top level. Figure 5: Example of the identified hierarchical discourse segments where DS is a discourse segment and s is a sentence: An essay of high score whose essay-id is 913590 in TOEFL (see Appendix E for more details). (2018) mostly are vacuous or shallow trees, whose proportion of leaf nodes is greater than 0.9. In contrast, the measures confirm that our model finds differences in the structure of texts of different score levels. The trees are not shallow trees, there is even no small tress, and the proportion of leaf nodes is less than 15%. The normalized arc length is high in our trees, which indicates that there is content connected to the root in the late part of a document. We suspect that this is the result of modeling the changes of focus instead of being biased to the focus captured in the beginning a document. Figure 5 visualizes an example essay in TOEFL. If texts are scored lower, trees are higher with more leaf nodes, and the proportion of nodes at the top level is lower. In NYT, we observe that the trees are more similar according to the four measures. However, we still observe texts of lower quality in NYT have higher trees and more leaf nodes. These trees are more skewed. This suggests that the focus is more biased to specific content in the texts of lower quality. In our manual examination, we also observe a few cases that texts of lower quality show very shallow trees. This suggests that the focus changes less frequently than in texts of higher quality.  Table 5: Top-10 most preferred centers (proportions) of essays submitted to the same prompt in TOEFL, a NYT article whose id is 1458761, and a NYT article whose id is 1516415 (see Appendix F for more details).

Ferracane et al. (2019) show that trees learned by the model of Liu and Lapata
T  Table 6: Proportion of function words determined as centers in essays submitted to the prompt 1 and 5 in TOEFL (T), a NYT article whose id is 1458761 (N-14*), and a NYT article whose id is 1516415 (N-15*).

Centering Analysis
We finally inspect the identified centers to investigate what our model learns with regard to the most preferred centers in Centering theory. We explore two questions, (1) whether the identified centers are related to the given topic of a text and (2) whether the centers rely on function words. While all essays submitted to a prompt in TOEFL have the same topic, articles in NYT have different topics. Hence, we inspect centers at the prompt level in TOEFL and for each document in NYT.
We first examine the proportion of most preferred centers. Table 5 shows that our discourse structure parser indeed identifies centers related to the topic of prompts in TOEFL and to the title of each document in NYT. For instance, the given topic of prompt 1 in TOEFL is "Is it better to have a broad knowledge of many academic subjects than to specialize in one specific subject?", and we observe that preferred centers are related to their topic. However, we also observe a few types of undesirable cases when interpreting centers. The most common case is that the identified centers are related to the topic but also are redundant to other centers. They indicate the same meaning, but they have a different form, such as different tense or grammatical number. Another undesirable case is when centers are subword-level tokens which are produced by subword tokenization deployed in the pretrained language model. It not only makes us difficult to interpret centers intuitively, but also the model might capture a focus different from the author's intention.
We then verify whether our model determines function words as centers. Table 6 shows the proportion of function words determined as centers and the average proportion among all centers. It shows that the proportion of function word is less or comparable to other centers. Hence, this analysis indicates that our model does not exploit function words to capture focus.

Conclusions
We propose a neural model of coherence inspired by Centering theory. The intuition is that it describes coherence by tracking the changes of the focus between discourse segments. Our model identifies the hierarchy of discourse segments without human annotations, and incorporates structural information into the model. We demonstrate that the identified hierarchical discourse segments improve performance of the model on two tasks, automated essay scoring and assessing writing quality. Interestingly, we find statistical differences of trees generated from texts of different quality.

B Training and Parameters
For TOEFL, we use a mini-batch size of 32 with random-shuffle. For NYT, we use a mini-batch size of 128 with random-shuffle. For both datasets, we train models with a learning rate of 0.003 and epsilon of 1e-4. We use the ADAM optimizer with a learning rate of 0.003. We evaluate performance for 20 epochs. For the baseline models which do not use a pretrained language model, we use Glove pretrained embeddings with 100-dimensional for TOEFL and with 50-dimensional for NYT. We clip gradients by 1.0 excepts for the latent learning model of discourse parsing. To update sentence representations obtained by a pretrained language model, we use the same dimension of the pretrained language model on a structure-aware transformer. We manually tune hyperparameters. We use 46GB GPU memory of two NVidia P40s for each run. For training our model, it takes approximately 0.3 days on TOEFL and 11 days on NYT. It takes less processing time to train other two baselines relying on the pretrained language model. Prompt 1 Agree or Disagree: It is better to have broad knowledge of many academic subjects than to specialize in one specific subject. Prompt 2 Agree or Disagree: Young people enjoy life more than older people do. Prompt 3 Agree or Disagree: Young people nowadays do not give enough time to helping their communities. Prompt 4 Agree or Disagree: Most advertisements make products seem much better than they really are. Prompt 5 Agree or Disagree: In twenty years, there will be fewer cars in use than there are today. Prompt 6 Agree or Disagree: The best way to travel is in a group led by a tour guide. Prompt 7 Agree or Disagree: It is more important for students to understand ideas and concepts than it is for them to learn facts. Prompt 8 Agree or Disagree: Successful people try new things and take risks rather than only doing what they already know how to do well. shows that multi-head self-attention scores capture salient items such as a piano or a home, or linguistic notions such as he or it.

C Scores on Muti-head Attention
• s 1 : Peter wants to play the piano.
• s 2 : He went to the piano store to buy one.
• s 3 : It was closed.

D Experiments Details
We report not only performance of models on test sets, also performance on validation sets, and standard deviation in 10 runs as shown in Table 9-10.
These results indicate that our model achieves stateof-the-art performance on both validation sets and test sets. Figure 7 shows the error analysis on TOEFL.  E Example of a identified structure Figure 8 visualizes the identified structure from the essay whose score is low. We only present the identified structure due to licensing restrictions of TOEFL. Table 11 shows top-10 most preferred centers in TOEFL and four articles in NYT. Figure 8: Example of the identified hierarchical discourse segments where DS is a discourse segment and s is a sentence: an essay of low score whose essay-id is 1563434 in TOEFL (see Appendix E for more details).    Table 11: Top-10 most preferred centers (proportions) of essays submitted to the same prompt in TOEFL (see Appendix. A for given topics) and four articles in NYT whose id is 1458761, 1516415, 1705265, and 1254567, respectively. The title of NYT articles are as follows, 1458761: "Among 4 States, a Great Divide in Fortunes", 1516415: "One Cosmic Question, Too Many Answers", 1705265: "Which of These Foods Will Stop Cancer?", and 1254567: "Quantum Theory Tugged, And All of Physics Unraveled".