Analyzing Neural Discourse Coherence Models

In this work, we systematically investigate how well current models of coherence can capture aspects of text implicated in discourse organisation. We devise two datasets of various linguistic alterations that undermine coherence and test model sensitivity to changes in syntax and semantics. We furthermore probe discourse embedding space and examine the knowledge that is encoded in representations of coherence. We hope this study shall provide further insight into how to frame the task and improve models of coherence assessment further. Finally, we make our datasets publicly available as a resource for researchers to use to test discourse coherence models.


Introduction
Coherence refers to the properties of a text that indicate how meaningful (sub-)sentential constituents are connected to convey document-level meaning. Different theories have been proposed to describe the properties that contribute to discourse coherence and some have been integrated with computational models for empirical evaluation. A popular approach is the entity-based model which hypothesizes that coherence can be assessed in terms of the distribution of and transitions between entities in a text -by constructing an entity-grid (Egrid) representation Lapata, 2005, 2008), building on Centering Theory (Grosz et al., 1995). Subsequent work has adapted and further extended Egrid representations (Filippova and Strube, 2007;Burstein et al., 2010;Elsner and Charniak, 2011;Guinaudeau and Strube, 2013). Other research has focused on syntactic patterns that co-occur in text (Louis and Nenkova, 2012) or semantic relatedness between sentences Soricut and Marcu, 2006;Somasundaran et al., 2014) as key aspects of coherence modeling. There have also been attempts to model coherence by identifying rhetorical relations that connect textual units (Mann and Thompson, 1988;Lin et al., 2011;Feng et al., 2014) or capturing topic shifts via Hidden Markov Models (HMM, Barzilay and Lee, 2004). Other work has combined approaches to study whether they are complementary (Elsner et al., 2007;Feng et al., 2014). More recently, neural networks have been used to model coherence. Some models utilize structured representations of text (e.g. Egrid representations, Tien Nguyen and Joty, 2017;Joty et al., 2018) and others operate on unstructured text, taking advantage of neural models' ability to learn useful representations for the task (Li and Jurafsky, 2017;Logeswaran et al., 2018;Farag and Yannakoudakis, 2019;Moon et al., 2019).
Coherence has typically been assessed by a model's ability to rank a well-organized document higher than its noisy counterparts created by corrupting sentence order in the original document (binary discrimination task), and neural models have achieved remarkable accuracy on this task. Recent efforts have targeted additional tasks such as recovering the correct sentence order (Logeswaran et al., 2018;Cui et al., 2018), evaluating on realistic data (Lai and Tetreault, 2018;Farag and Yannakoudakis, 2019) and focusing on open-domain models of coherence (Li and Jurafsky, 2017;. However, less attention has been directed to investigating and analyzing the properties of coherence that current models can capture, nor what knowledge is encoded in their representations and how it might relate to aspects of coherence.
In this work, we systematically examine what properties of discourse coherence current coherence models can capture. We devise two datasets that exhibit various kinds of incoherence and analyze model ability to capture syntactic and semantic aspects of text implicated in discourse organisation. We furthermore investigate a set of probing tasks to better understand the information that is encoded in their representations and how it might relate to aspects of coherence. We hope this study shall provide further insight into how to frame the task and improve models of coherence assessment further. Finally, we release our evaluation datasets as a resource for the community to use to test discourse coherence models. 1

Neural Coherence Models
We experiment with a number of existing and stateof-the-art neural approaches to coherence assessment, that have publicly available implementations, and present details of the models below. Across all the BERT-based models, we use bert-large-uncased and layer 16 following Liu et al. (2019) and Hewitt and Manning (2019). Multi-task learning (MTL, Farag and Yannakoudakis, 2019): The model applies a Bi-LSTM on input GloVe word embeddings (Pennington et al., 2014) followed by attention to build sentence representations; then builds a second Bi-LSTM with attention to compose a document vector. A linear operation followed by a sigmoid function is applied to the document representation to predict an overall coherence score as the main objective. Inspired by the Egrid approaches, the model is also optimized to predict the grammatical roles of the input words at the bottom layer of the network as an auxiliary task. MTL with BERT embeddings (MTL bert ): We replicate the previous MTL model but now use BERT embeddings (Devlin et al., 2019) to initialize the input words. Single-task learning (STL, Farag and Yannakoudakis, 2019): This model has the same architecture as MTL but only performs the coherence prediction task, excluding the grammatical role auxiliary objective. STL with BERT (STL bert ): This is the same as STL but uses BERT embeddings. Local Coherence Discriminator with Language modeling (LCD rnnlm , : The model generates sentence representations via an RNN language model, where word embeddings are initialized using GloVe. It then generates a representation for two consecutive sentences via concatenating the output of a set of linear transformations applied to the two sentences: concatenation, element-wise dif-ference, element-wise product and absolute value of element-wise difference. This representation is fed to an MLP layer to predict a local coherence score. 2 The overall coherence of a document is the average of its local scores. LCD with BERT (LCD bert ): We create a variant of the LCD rnnlm model where instead of using an RNN language model encoder, we encode each sentence as the average BERT vectors of the words it contains. Everything else remains the same. Local Coherence (LC, Li and Jurafsky, 2017): The model generates sentence vectors via an LSTM over GloVe-initialized word embeddings; then a window approach is applied over adjacent sentences to get embeddings of groups of sentences and predict local coherence scores. The final document score is calculated by averaging its local scores. Egrid CNN (Egrid cnn , Tien Nguyen and Joty, 2017): The model applies a CNN over Egrid representations across groups of consecutive sentences; the CNN slides multiple filters of weights to extract feature maps that represent high-level entitytransition features, followed by a max pooling function to focus on the important features. Furthermore, additional entity-related features are integrated such as salience, proper mentions and named entity type.

Binary Discrimination Task
Binary discrimination is a typical approach to assessing neural coherence models where a wellorganized document should be ranked higher than its permuted counterparts created by corrupting sentence order. Following previous work, we train and test 3 the coherence models on the WSJ 4 and evaluate them using Pairwise Ranking Accuracy (PRA), which is calculated based on the fraction of correct pairwise rankings between a coherent document and its incoherent counterparts.
In Table 1, we present the performance of all coherence models. The high accuracy of the models demonstrates their efficacy for the task of selecting a maximally coherent sentence order from a set of candidate permutations. We note that the LCD and  MTL BERT variants achieve a new state-of-the-art on the WSJ. The remarkable accuracy on this task may render this problem fully solved. Herein, we seek to investigate how well these models of coherence can capture aspects of text implicated in discourse organisation. We devise a set of datasets and systematically test model susceptibility to syntactic or semantic changes.

Cloze Coherence (CC) Dataset
We compile a large-scale dataset, to which we refer as Cloze Coherence (CC), of coherent and incoherent examples, where the former are intact well-written texts while the latter are the result of applying syntactic or semantic perturbations to the coherent ones.

Coherent examples
For the sake of specifically testing for coherence, we avoid complex linguistic structures. Specifically, we focus on coherent examples that consist of two short sentences that are coreferential and exhibit a rhetorical relation (such properties can be manipulated to create incoherent counterparts). Furthermore, we focus on examples that are self-contained, meaning that they do not reference or rely on an outer context to be interpreted. We find that narrative texts are good candidates to satisfy these criteria and therefore create our coherent examples from the ROCStories Cloze dataset 5 (Mostafazadeh et al., 2016).
ROCStories Cloze contains short stories of 5 sentences manifesting a sequence of causal or temporal events that have a shared protagonist. A story usually starts by introducing a protagonist in the first sentence, then subsequent sentences describe events that happen to them in a logical / rhetorically plausible manner. The dataset was designed for commonsense reasoning by testing the ability of machine learning models to select a plausible ending for the story out of two alternative endings. Here, our main aim is to challenge the models and investigate whether they truly understand intersentential relations and coherence-related features. We specifically utilize the first two sentences in the stories to compose the coherent examples in our dataset. 6 Selecting the first two sentences helps make the examples self-contained since there is no preceding context to refer to, and no cataphoric relations to consequent sentences. Regarding rhetorical relations in these sentences, Mostafazadeh et al. (2016) conducted a temporal analysis to investigate the logical order of the events presented in a story, demonstrating, among others, that the first and second sentences in the stories are presented in a commonsensical temporal manner with logical links between them. In order to examine coreferential relations between the two sentences in each extracted pair, we gather a set of statistics. We adopt a heuristic approach 7 by simply counting the number of second sentences that contain at least one third person pronoun (either personal or possessive) and find that they constitute 80% of the examples. 8 Third person pronouns anaphorically refers to preceding items in text, which could occur in the same sentence or the previous one (i.e., the first sentence). We, therefore, randomly select, and manually inspect, 500 examples that contain third person pronouns in their second sentence and find that in 95% of them the referenced entity appears in the first sentence. Furthermore, third person pronouns are not the only coreferential relations in the examples. For instance, we find that 90% of the second sentences contain a personal or possessive pronoun (whether it is first, second or third person), which could also signal coreference, e.g., 'I was walking to school. Since I wasn't looking at my feet I stepped on a rock.' There are also other coreferential devices such as: demonstrative references (e.g., 'this' and 'there'), 'the' + noun, proper names or nominal substitutions (e.g., 'one' or 'ones') to name a few (Halliday and Hasan., 1976), so the true proportion of coreferential pairs will be higher.  We use the same train/dev/test splits provided with ROCStories Cloze but only keep the first two sentences in each story. We exclude cases with erroneous sentence boundaries, 9 yielding 97, 903 examples for training, 1, 871 for development, and 1, 871 for testing, and a training vocabulary size of 29, 596 tokens. Each instance in our dataset contains two sentences that represent a coherent pair.

Incoherent examples
To assess model susceptibility to syntactic or semantic alterations, we construct incoherent examples by applying two different transformations to each coherent pair resulting in two different sets of data. cloze swap We create incoherent examples by swapping the two sentences in a coherent pair. This mostly breaks the coreference relation between them and/or the rhetorical relation (e.g. temporal or causal) by reversing the event sequence. The dataset, referred to as cloze swap, is balanced, i.e., the number of incoherent examples is the same as the number of the coherent ones above. The way cloze swap is created corrupts the syntactic patterns that co-occur in coherent texts (e.g. S → NP-SBJ VP | NP-SBJ → PRP) as demonstrated by Louis and Nenkova (2012). cloze rand Here we create incoherent examples by keeping the first sentence of a coherent pair intact and replacing the second with a randomly selected second sentence from (the same split of) our set of coherent examples. This dataset, referred to as cloze rand, is also balanced (for each coherent pair, we compose one incoherent counterpart), and constitutes examples with changed semantics but with the main syntactic pattern intact. As the randomly-created pair may still be coherent, we address this by: 1) constraining random selection of the second sentence to not begin with the same word as the second sentence in the original pair, or with the pronoun 'he' if the original starts with 'she', and vice-versa 10 (we note 70% of the second sentences in ROCStories Cloze start with a pronoun); 2) using human evaluation to further assess the validity of this data and get an estimate of upperbound performance on the task. Specifically, we randomly select 100 coherent sentence pairs from our test split along with their own incoherent counterparts and ask two annotators (who are not authors of this paper), with high English proficiency levels, to rank each set of coherent-incoherent examples based on which one they considered to be more coherent and plausible. The average PRA of the annotators is 94.5%. Table 3 shows examples from cloze swap and cloze rand. As our datasets are balanced (one incoherent counterpart per coherent pair), we have a total number of 195, 806, 3, 742 and 3, 742 instances in the train, dev, and test splits respectively for each cloze dataset (cloze swap and cloze rand have the same coherent examples, and the same number of coherent and incoherent examples).
We note that the gold labels in this data are not to be interpreted as (overall) binary indicators of coherence. We rather use these to test model performance using PRA, i.e. we only compare a coherent pair with its own incoherent counterpart.

Controlled Linguistic Alterations (CLA) Dataset
In order to further understand the properties of coherence that current coherence models capture, we manually construct a dataset of controlled sets of linguistic changes. We first identify a set of coherent, well-written texts of two consecutive sentences from business and financial articles in the BBC, the Independent and Financial Times (this allows us to stay in the same domain as the one used for training the models -the WSJ). We focus on sentence pairs where the subject of the first sentence is pronom-Coherent example Incoherent example from cloze swap Incoherent example from cloze rand Tyrese joined a new gym.The membership allows him to work out for a year.
The membership allows him to work out for a year. Tyrese joined a new gym.
Tyrese joined a new gym. As children they hated being dressed alike. Jasmine doesn't know how to play the guitar. She asked her dad to take her to guitar class.
She asked her dad to take her to guitar class. Jasmine doesn't know how to play the guitar. Jasmine doesn't know how to play the guitar. May thought her milk was no good. I wanted to play an old game one day. When I looked in the game's case the CD was missing.
When I looked in the game's case the CD was missing. I wanted to play an old game one day.
I wanted to play an old game one day. Jason pressed the buzzer since he knew the answer. Table 3: Examples of coherent and incoherent pairs from the cloze swap and cloze rand datasets.

Original
A government paper on Monday found UK and EU firms would be faced with a "a significant new and ongoing administrative burden" in the event of a no-deal Brexit. It found large firms importing and exporting at scale would need to fill in forms taking one hour 45 minutes on average and cost £28 per form for each load imported.

Swap
It found large firms importing and exporting at scale would need to fill in forms taking one hour 45 minutes on average and cost £28 per form for each load imported. A government paper on Monday found UK and EU firms would be faced with a "a significant new and ongoing administrative burden" in the event of a no-deal Brexit.
Random 1-A government paper on Monday found UK and EU firms would be faced with a "a significant new and ongoing administrative burden" in the event of a no-deal Brexit. She spent over a decade at Swiss investment bank UBS before joining the UK Treasury's council of economic advisers in 1999. 2-Lady Vadera was born in Uganda and moved to the UK as a teenager. It found large firms importing and exporting at scale would need to fill in forms taking one hour 45 minutes on average and cost £28 per form for each load imported.

Lexical Substitution
The paper found large firms importing and exporting at scale would need to fill in forms taking one hour 45 minutes on average and cost £28 per form for each load imported. A government paper on Monday found UK and EU firms would be faced with a "a significant new and ongoing administrative burden" in the event of a no-deal Brexit.

Prefix Insertion
More Specifically, it found large firms importing and exporting at scale would need to fill in forms taking one hour 45 minutes on average and cost £28 per form for each load imported. A government paper on Monday found UK and EU firms would be faced with a "a significant new and ongoing administrative burden" in the event of a no-deal Brexit. Lexical Perturbations A government paper on Monday found UK and EU firms would be faced with a "a significant new and ongoing administrative burden" in the event of a no-deal Brexit. It found large firms importing and exporting at scale would need to fill in cups taking one hour 45 minutes on average and cost £28 per cup for each load imported. Corrupt Pronoun A government paper on Monday found UK and EU firms would be faced with a "a significant new and ongoing administrative burden" in the event of a no-deal Brexit. He found large firms importing and exporting at scale would need to fill in forms taking one hour 45 minutes on average and cost £28 per form for each load imported. Table 4: Examples from our manually constructed CLA dataset. For 'Random' we create two incoherent instances: one where the first sentence is unchanged and the second is randomly selected (1-); and another where the first sentence is randomly selected and the second is kept intact (2-). inalized in the second, and the second sentence begins with this pronoun. We select the examples so that they are self-contained and do not reference an outer context. We then manually create incoherent counterparts by modifying the coherent examples in a constrained way in order to systematically examine model performance. Specifically, we apply the following sets of perturbations to our set of coherent sentence pairs, examples of which are presented in Table 4. Swap. We simply swap the two sentences.
Random. We keep the first sentence intact and select a second sentence randomly from our set of coherent examples. We constrain the selection so that the subject pronoun is different from the subject pronoun in the original sentence. 11 We also create another random pair with the same constraint but now changing the first sentence. Thus each original coherent example has two incoherent counterparts. Lexical Substitution. We swap the two sentences in a coherent pair but replace the subject pronoun in the second sentence with the + a general noun that substitutes the subject in the first sentence (e.g. 11 We also take into account that some subjects could be referred to by 'he', 'she' or 'they' and thus factor that into the selection. the company, the woman, etc.). Prefix Insertion. We analyze the WSJ training data and find that the average number of times the first sentence in a document starts with a pronoun is 0.02 (and never with 'he' or 'she') which is significantly less than the average number of times a sentence starts with a pronoun (regardless of its position) which is 0.07. This difference is not maintained in the randomly ordered documents in the WSJ training set and so this might give a signal to the models to detect that a swapped pair that starts with a pronoun is less coherent. To see if such positional information plays a role in model prediction, we insert a phrase, before the subject pronoun after swapping the sentences, that doesn't change the propositional content (e.g. 'More specifically', 'However', etc.). We can then observe whether this insertion will change the prediction of the model. Lexical Perturbation. We investigate the robustness of the models to minor lexical changes that result in incoherent meaning, by replacing one word in either of the two sentences (if the word is repeated, we change that too). We choose a replacement word from the training vocabulary of the WSJ with the same part-of-speech tag. For example, in Table 4 'form' is replaced with 'cup' and 'forms' with 'cups'.
Corrupt Pronoun. We replace the subject pronoun in the second sentence with another pronoun that cannot reference anything in the first sentence. With this method, we test whether the models are capable of resolving coreferences or just rely on syntactic patterns.
Our dataset contains a total of 240 examples of coherent and incoherent pairs of sentences (30 coherent examples and 210 incoherent counterparts). Our constrained set of modifications ensures that all coherent examples are more coherent than any of the incoherent counterparts in the data.

Experiments
Table 5 (top) presents the PRA performance of the models trained on the WSJ (Section 3) when they are evaluated on the test sets of the CC datasets (rows 'cloze swap' and 'cloze rand'). We find that, overall, models are good at detecting syntactic alterations (cloze swap; PRA ranging from 69.3 to 84.6) even though the test data is from a domain different than the training one. However, most models perform poorly on semantic alterations (cloze rand; PRA ranging from 48.5 to 54.5), the only exception being LCD bert that achieves a PRA of 71. Specifically, models that use RNN-based sentence encoders (the first six models), even when initialised with BERT, or apply a CNN to capture entity transitions fall short in capturing semantic changes despite the fact that cloze rand is from the same domain as cloze swap. In contrast, LCD bert is more capable of detecting semantic changes where the model builds sentence representations by averaging BERT vectors then applies a set of linear transformations to increase its expressive power, surpassing its RNN-based counterpart (LCD rnnlm) with 16.5% on cloze rand. Additionally, across models, we observe that the use of contextualized (BERT) embeddings consistently improves performance on both cloze tasks, although performance on semantic alterations remains close to random.
We investigate domain shift effects and fine-tune the WSJ-trained models on each of the cloze swap and cloze rand training sets (Section 4) and reevaluate performance on the respective test sets. Specifically, we use an MLP layer over the models' pre-prediction representation, followed by sigmoid non-linearity. The models are optimized using the mean squared error between the gold labels (0 or 1) and the predicted scores. 12 In this setup, only the 12 We use Adam (Kingma and Ba, 2015), batch size 64, MLP layer is fine-tuned and not the whole coherence model which allows us to create a fast efficient evaluation framework that can be applied as a further examination step after coherence models are developed and tuned on their respective datasets, instead of training the models from scratch. The results of the fine-tuned models are presented in Table 5 (CC; rows 'fine-tuned'). Although we can see that there is some domain effect, we nevertheless find that the results confirm our earlier observation: performance on semantic alterations is, overall, poor, in contrast to syntactic ones (cloze swap).
In Table 5 (bottom), we can observe model performance (PRA) on our constrained set of manually devised examples (CLA). Again, we observe a similar result: across RNN-based models, performance is particularly low on random examples, which suggests that they struggle to detect topical or rhetorical shifts and unresolved references if the main syntactic pattern is maintained. The exception is LCD bert which is again the best performing model (PRA 78.3).
We furthermore observe that now Egrid cnn is the second best model on CLA Random (PRA 71.6). A sparser entity grid where entities in the two sentences are different allows the model to detect such cases (e.g. in the example in Table 4, 'firms' is mentioned in the two sentences, while in the two random examples, it is only mentioned in one). However, its substantial difference in PRA on CLA Random compared to cloze rand (53.4) suggests that the lower performance observed in the latter is due to domain shift effects, something which we do not observe (to the same extent) with LCD bert . Regarding the CLA Swap results, we can again confirm models' capability of detecting corrupted syntactic constructions. We furthermore observe that they are able to maintain good performance in the cases where a prefix is inserted ('Prefix Insertion') or the subject pronoun is substituted with a lexical item ('Lexical Substitution'). This suggests that they can capture the relevant syntactic patterns and do not rely solely on positional features.
Performance is overall low on lexical perturbations and corrupt pronouns which suggests that the models are not sensitive to minor lexical changes even if they result in implausible meaning and they also struggle to resolve pronominal references.
L2 regularization, and a learnable penalty rate (search space {0.00001, 0.0001, 0.001, 0.01}). We use early stopping and stop training if PRA does not improve on the dev set over 5 epochs (max epochs 200). MLP hidden unit size is 100.   However, the exception is LCD bert (with PRA 80 on lexical perturbations and 76.6 on corrupt pronoun) suggesting a better ability at capturing semantics and resolving references. Across all six CLA datasets ('All data'; Table  5), we find that, overall, LCD bert is the top performing model (average PRA). The 'All data' row reports the result of comparing a coherent example against its incoherent counterparts across the different alterations (i.e., in Table 4, the original example is compared against all the examples in the table and this is applied to all the original examples in the dataset). If we furthermore compare all the coherent examples against the incoherent ones in the whole dataset (rather than against their own incoherent counterparts), we find that a similar performance pattern is maintained (row 'All data (TPRA)', i.e., all data Total Pairwise Ranking Accuracy).

Probing Coherence Embedding Space
Inspired by previous work (Conneau et al., 2018), and to better understand the information that is encoded in the representations of coherence models, we investigate probing tasks that can capture coherence-related features.
We experiment with the following set of sentence-level tasks that are relevant to discourse coherence: 1) the subject number (SubjNum) task that detects the number of the subject of the main clause; 2) the object number (ObjNum) task that detects the number of the direct object of the main clause; 3) the coordination inversion (CoordInv) task that contains sentences consisting of two coordinate clauses, where the two clauses are inverted in half of the sentences and kept intact in the other half (the task is to detect whether a sentence is modified or not); 4) the corrupt agreement (Cor-ruptAgr) task where sentences are corrupted by inverting the verb number (the task is to identify corrupted sentences).
Tasks 1, 2 and 4 align with Centering theory as they probe for subject and object relevant information; the theory suggests that subject and object roles are indicators of entity salience. On the other hand, task 3 tests whether the models can capture intra-sentential coherence. For these tasks, we use the datasets from Conneau et al. (2018) Conneau et al. (2018). Our probing model consists of an MLP layer over model sentence representations, followed by sigmoid non-linearity. We use the same training parameters as Conneau et al. (2018). 13 Results Table 6 presents the results. 14 Overall, we observe that models are better at detecting Sub-jNum, and ObjNum (accuracy of at least 61% for all models except LC which is the odd one out) compared to CorruptAgr and CoordInv, with the last two being particularly challenging for most models (minimum accuracy of 53% excluding LC). For SubjNum and ObjNum the models can find hints in words other than the target word (as the majority of nouns in a sentence tend to have the same number, with 75.9% of SubjNum test sentences and 78.7% of ObjNum ones containing nouns of the same number in the same sentence (Conneau et al., 2018)). On the other hand, CorruptAgr examples are longer and with more syntactic variations and require the models to detect the dependency between verbs and their subjects. CoordInv is also a difficult task for the models particularly since they are pre-trained on the WSJ to focus on the order of sentences, not clauses.
Across all tasks, we find that LCD bert achieves the best performance, outperforming all other approaches. We note, however, that LCD bert does not fine-tune its sentence representations during coherence training in the WSJ but they are rather fixed and based on the average of BERT-based word embeddings (Section 2). This means the probing model fine-tunes averaged BERT-based word embeddings rather than actual sentence parameters from the LCD coherence model. Therefore, the level of performance observed is not representative of the maximum performance coherence models can achieve on these tasks. 15 We surmise that the comparatively lower performance observed with MTL bert and STL bert (whose sentence representations are fine-tuned during coherence training) is due to their coherence training objective. The models are optimized on the binary discrimination task, i.e. learning to rank a well-organized document higher than its permuted counterparts. This is an overly simplistic approach to coherence modeling that may be making models (and their representations) more susceptible to losing useful linguistic information. Having said that, though, MTL bert , that has direct training signal with respect to the words' grammatical roles, is able to alleviate this issue to an extent and is the next best performing model on SubjNum, ObjNum and CorruptAgr.
Across tasks, LC is the odd one out, and the worst performing model. This can be explained partly by its comparatively lower performance on 15 Nevertheless, we observe that LCD bert outperforms the best reported result on CoordInv by Conneau et al. (2018). the simpler binary discrimination task (Table 1) and partly by the simplicity of the approach: LC utilizes no attention mechanism as the MTL and STL family of models do, nor has expressive enough transformations as LCD rnnlm does.

Discussion
Our evaluation experiments on two coherence datasets reveal that RNN-or EGrid-based coherence models are able to detect syntactic alterations that undermine coherence, but are less effecient at detecting semantic ones even after fine-tuning on the latter. We furthermore find that they particularly struggle with recognizing minor lexical changes even if they result in implausible meaning and resolving pronominal references. On the other hand, these models are particularly good at detecting cases where a prefix is inserted or the subject pronoun is substituted with a lexical item, suggesting that they are capable of capturing the relevant syntactic patterns and do not solely rely on positional features. We find that the best performing model overall is LCD bert which does not use an RNN sentence encoder but rather builds sentence representations by averaging BERT embeddings then utilizes a number of linear transformations over adjacent sentences to facilitate learning richer representations.
Our probing experiments reveal that models are better at encoding information regarding subject and object number followed by verb number (Cor-ruptAgr). These probing tasks align with Centering theory as they probe for subject and object relevant information. The task that tests for knowledge on coordination inversion is the lowest performing one overall, suggesting that there is little capacity at capturing information related to intra-sentential coherence. Excluding LCD bert , MTL bert is the best performing model; nevertheless, there is still scope for substantial improvement across all probing tasks and particularly on CoordInv and Corrup-tAgr.

Conclusion
We systematically studied how well current models of coherence can capture aspects of text implicated in discourse organisation. We devised datasets of various kinds of incoherence and examined model susceptibility to syntactic and semantic alterations. Our results demonstrate the models are robust with respect to corrupted syntactic patterns, prefix insertions and lexical substitutions. However, they fall short in capturing rhetorical and semantic corruptions, lexical perturbations and corrupt pronouns. We furthermore find that discourse embedding space encodes subject and object relevant information; however, there is scope for substantial improvement in terms of encoding linguistic properties relevant to discourse coherence. Experiments on coordination inversion further suggest that current models have little capacity at encoding information related to intra-sentential coherence.
We hope this study shall provide further insight into how to frame the task of coherence modeling and improve model performance further. Finally, we make our datasets publicly available for researchers to use to test coherence models.