Incremental Neural Lexical Coherence Modeling

Pretrained language models, neural models pretrained on massive amounts of data, have established the state of the art in a range of NLP tasks. They are based on a modern machine-learning technique, the Transformer which relates all items simultaneously to capture semantic relations in sequences. However, it differs from what humans do. Humans read sentences one-by-one, incrementally. Can neural models benefit by interpreting texts incrementally as humans do? We investigate this question in coherence modeling. We propose a coherence model which interprets sentences incrementally to capture lexical relations between them. We compare the state of the art in each task, simple neural models relying on a pretrained language model, and our model in two downstream tasks. Our findings suggest that interpreting texts incrementally as humans could be useful to design more advanced models.


Introduction
Coherence describes the semantic relation between elements of a text. It distinguishes a text as either a unified whole or a collection of unrelated sentences. Lexical coherence represents the cohesive effect achieved by lexical relations (Halliday and Hasan, 1976).
Earlier work mainly focuses on capturing lexical relations using external resources (Morris and Hirst, 1991). Mesgar and Strube (2016) introduce a graph model, the latest model for lexical coherence, to represent lexical relations between sentences on the graph. It encodes sentences as nodes and lexical relations between sentences as edges. This model, nevertheless, considers lexical items independently.
Recent neural models adopt a modern machine learning-based technique (Liu and Lapata, 2019;Gupta and Durrett, 2019), the Transformer (Vaswani et al., 2017). It relates all items simultaneously to capture semantic relations in sequences. More recently, large-scale pretrained language models, Transformerbased models pretrained on the massive amounts of text, have led to significant improvements in a range of NLP tasks (Devlin et al., 2019).
However, the Transformer processes texts in a way which is different from the way humans do it. Psycholinguistic experiments show that humans read texts incrementally (Marslen-Wilson, 1975;Kamide et al., 2003;Gibson and Warren, 2004). Köhn (2018) claim that NLP systems which follow this theory should interpret texts incrementally, too. Do neural models benefit from both pretrained language models and incremental sentence processing?
To investigate this question, we propose a coherence model which interprets sentences incrementally to capture lexical relations. For the ongoing sentence being read, our model first captures a semantic centroid vector which represents the centroid of preceding sentences. The centroid vector is computed as averaged representations of sentences. The model then measures semantic similarity between the centroid vector and the current sentence. Our model iterates this procedure for all sentences. return centroid last , dist vec 11: end procedure We evaluate our model on two tasks: assessing discourse coherence and automated essay scoring. We compare our model with the state of the art in each task and two variants of a simple baseline relying on a pretrained language model: the first baseline encoding sentences individually, and the second baseline encoding a whole text at once 1 . Morris and Hirst (1991) propose lexical chains which identify sequences of related words using a lexical knowledge base. To identify lexical relations without human annotation, generative models have been developed, which learn lexical distributions. However, they may not generalize well across multiple datasets drawn from different distributions (Eisenstein and Barzilay, 2008;McNamara et al., 2010). Mesgar and Strube (2016) propose a graph-based model to overcome these limitations using word embeddings pretrained on a large-scale dataset. They introduce a graph model to represent lexical relations between sentences, which encodes sentences as nodes and lexical relations between sentences as edges. This graph-based model captures k-node subgraphs of this graph and represents coherence patterns by the frequency of subgraphs. However, their model neglects context to capture lexical relations.

Related Work
Modeling lexical coherence has proven to be effective in diverse NLP applications like summarization (Erkan and Radev, 2004), translation (Xiong et al., 2013), and discourse parsing (Jia et al., 2018). We believe that our study for lexical coherence can be beneficial in these applications.

Our Model
Figure 1 shows our model architecture. Our model takes sentence representations using a pretrained language model. The model then feeds sentences into the lexical coherence module to produce the semantic centroid vector and the semantic similarity vector. We concatenate the two vectors to generate a model output through a feed-forward network.
Sentence representations: We first encode input sentences using a pretrained language model to produce word representations. We take a sentence representation as the average of all word representations in a sentence. We then feed the sentence representations to the lexical coherence module.
Incremental processing module: Algorithm 1 describes our lexical coherence module. To interpret the sentence being read, we update two components: a semantic centroid vector and a semantic similarity vector. The semantic centroid vector takes averaged representations of preceding sentences, and this vector represents their central point. We then measure the semantic similarity between the current sentence representation and the centroid vector. We use cosine similarity to measure semantic similarity. We iterate this procedure for all sentences.
A convolutional layer is applied to the semantic similarity vector to extract a feature map which represents the patterns of changes in semantic similarities. Max-pooling is applied to the feature map, and this lets the model capture features semantically relevant to the centroid vector.
Document representation: We concatenate the semantic centroid vector, updated on the last sentence, and the semantic similarity vector. Finally, a feed-forward network is applied on the representation to produce the output value.

Implementation Details
We implement our model using the PyTorch library and use the Stanford Stanza library 2 for sentence tokenization. For the baselines that do not use the pretrained language model, we use Glove for word embeddings, the pretrained word embeddings trained on Google News (Pennington et al., 2014). For our model, we apply a convolutional layer whose kernel size is 3, stride is 2, and padding is 2 and an adaptive max-pooling layer reducing a vector to the length of 5 (see the supplementary material for more details).
Many pretrained language models cannot encode long texts due to their training settings, or require a massive amount of memory to encode them. In this work, we employ XLNet for the pretrainedlanguage model . Unlike BERT (Devlin et al., 2019), since XLNet can handle any input sequence length, which is required for our datasets to encode a whole text at once.
We report the results by the mean of 10 cross-validation runs with different random seeds. We validate statistical significance with a one-sample t-test with p-value < 0.01. We use 23GB GPU memory of a NVidia P40 for each run.

Simple Baselines relying on a pretrained language model
To investigate the influence of a pretrained language model on the tasks, we present two simple baselines relying on the pretrained language model. The first model encodes an input document at the sentence level and averages the encoded representations (Averaged-XLNet-Sent). The second model has the same architecture but it encodes an input document at the document level at once (Averaged-XLNet-Doc). We compare these baselines with other models for both tasks.

Task 1: Assessing Discourse Coherence
Dataset: We first evaluate our model on the Grammarly Corpus of Discourse Coherence (GCDC) dataset (Lai and Tetreault, 2018). While previous work evaluates coherence models on formal texts (Barzilay and Lapata, 2008), GCDC is designed to evaluate coherence models on informal texts, such as emails or online reviews. The dataset contains four domains: Clinton and Enron for emails, Yahoo for questions and answers in an online forum, and Yelp for online reviews of businesses. The quality of the dataset is controlled to have evenly-distributed scores and a low correlation between discourse length and scores 3 .  Table 2: TOEFL Accuracy performance comparison ( * : our re-implementation).
Experimental setup: For GCDC, we perform the experiments following previous work (Lai and Tetreault, 2018). We perform 10-fold cross-validation, use the same evaluation measure, accuracy for 3-class classification, and use the same loss function, cross-entropy loss.
Baseline models: Barzilay and Lapata (2008) propose the entity grid, based on Centering Theory (Grosz et al., 1995). This model considers the distribution of entities over sentences. Guinaudeau and Strube (2013) convert the supervised entity grid into an unsupervised graph-based model. Li and Jurafsky (2017) propose a neural model which uses cliques, sets of adjacent sentences, to discriminate the difference of sentences extracted from original articles and randomly permutated ones. Mesgar and Strube (2018) propose a neural coherence model which finds the two most similar RNN outputs to determine the most salient part of sentences to connect adjacent sentences. Lai and Tetreault (2018) show that a simple neural model which uses paragraph information outperforms previous coherence models on GCDC.
Results: Table 1 shows the performance of coherence models on GCDC. The first baseline outperforms the previous models. Our model, which encodes sentences individually using the pretrained language model and interprets sentences incrementally, outperforms the first baseline. However, the second baseline, which -unlike humans-encodes a whole text at once, outperforms our model. We suspect that the characteristics of GCDC lead to this. Lai and Tetreault (2018) observe that many texts with low coherence are not well-organized and have unexpected topic switching more than others. The texts on GCDC mostly consist of several sentences, and the model might distinguish these cases well on relatively short sequences. To investigate this further, we next compare models on TOEFL where texts are written in an academic style.

Task 2: Automated Essay Scoring
Dataset: To examine the effectiveness of our model in a downstream task with formal texts, we evaluate our model on the Test of English as a Foreign Language dataset (TOEFL) dataset. TOEFL has an overall higher quality of essays compared to essays in a standard dataset for AES, the Automated Student Assessment Prize (ASAP) dataset 4 . The prompts in ASAP are written by students in grade levels 7 to 10 of US middle schools, whereas the prompts in TOEFL are submitted for the standard English test for the entrance to US universities by non-native students. The prompts in TOEFL do not vary so much, the student population is more controlled, and the essays have a similar length.
Experimental setup: We evaluate performance in-domain at the prompt level. We perform 5-fold crossvalidation. For 3-class classification, we use cross-entropy loss to train models and measure accuracy to evaluate models. We evaluate performance for 30 epochs on the validation set. Following previous work on AES (Taghipour and Ng, 2016), the model which reaches the best performance on the validation set is then applied to the test set (see the supplementary material for details).
Baseline models: Dong et al. (2017) introduce a model which consists of a convolutional layer, followed by a recurrent layer, and an attention layer (Bahdanau et al., 2015). We also compare with the state of the art on TOEFL, Nadeem et al. (2019). Inspired by Dong et al. (2017), Nadeem et al. (2019) propose a model which uses an attention layer to decide the relative weights automatically in adjacent words as well as sentences. However, we notice that Nadeem et al. (2019) evaluate their model in a different experimental setup. They filter out content with sentences longer than 40 words or documents longer than 25 sentences; they also evaluate performance without cross-validation 5 . To ensure a fair comparison, we changed the experimental setup in their implementation. Mesgar and Strube (2018) evaluate their coherence model on the AES task as well as the task of assessing readability.
Results: Table 2 summarizes the performance of models on TOEFL. The first baseline outperforms the previous models, and the second baseline shows better performance than them. Our model sets a new state of the art at this dataset. Texts included in TOEFL are organized better than those in GCDC where the second baseline outperforms our model. We suspect that the pretrained language model captures some patterns on long sequences to predict scores, rather than capturing relations between sentences. This suggests that our model benefits more from incremental language processing on long sequences.

Conclusions
We propose a coherence model which encodes sentences individually using a pretrained language model and interprets sentences incrementally. The simple baseline, which encodes a whole text at once unlike humans do, outperforms our model on GCDC which includes informal texts such as online reviews. However, our model outperforms this model on TOEFL whose texts are organized better. Our findings suggest that it could be useful to constrain models to be exposed limited information as humans do to design more advanced neural models with a pretrained language model. Table 3 describes statistics on two datasets, GCDC 6 and TOEFL 7 . We split a text at the sentence level by Stanford Stanza library, and tokenize them by the XLNet tokenizer.  Table 3: Dataset statistics on tokenization: four domains in GCDC, Yahoo (G-Y), Clinton (G-C), Enron (G-E), Yelp (G-P), and each TOEFL prompt (T-P).

Appendix A. Dataset Details
Prompt 1 Agree or Disagree: It is better to have broad knowledge of many academic subjects than to specialize in one specific subject. Prompt 2 Agree or Disagree: Young people enjoy life more than older people do. Prompt 3 Agree or Disagree: Young people nowadays do not give enough time to helping their communities. Prompt 4 Agree or Disagree: Most advertisements make products seem much better than they really are. Prompt 5 Agree or Disagree: In twenty years, there will be fewer cars in use than there are today. Prompt 6 Agree or Disagree: The best way to travel is in a group led by a tour guide. Prompt 7 Agree or Disagree: It is more important for students to understand ideas and concepts than it is for them to learn facts. Prompt 8 Agree or Disagree: Successful people try new things and take risks rather than only doing what they already know how to do well.