A Neural Local Coherence Analysis Model for Clarity Text Scoring

Local coherence relation between two phrases/sentences such as cause-effect and contrast gives a strong influence of whether a text is well-structured or not. This paper follows the assumption and presents a method for scoring text clarity by utilizing local coherence between adjacent sentences. We hypothesize that the contextual features of coherence relations learned by utilizing different data from the target training data are also possible to discriminate well-structured of the target text and thus help to score the text clarity. We propose a text clarity scoring method that utilizes local coherence analysis with an out-domain setting, i.e. the training data for the source and target tasks are different from each other. The method with language model pre-training BERT firstly trains the local coherence model as an auxiliary manner and then re-trains it together with clarity text scoring model. The experimental results by using the PeerRead benchmark dataset show the improvement compared with a single model, scoring text clarity model. Our source codes are available online.


Introduction
Text clarity scoring can be defined as a task that assesses well-structured in a text by grading. It is beneficial for not only authors/reviewers of scholarly papers but also students to improve their writing skills. Among several properties such as spelling, grammar, and word choice, local coherence, which captures text relatedness at the level of sentence-to-sentence transitions (Barzilay and Lapata, 2008), is one of the main properties to identify whether a text is well-structured or not. Well-known early attempts for modeling coherence are lexical coherence models (Halliday and Hasan, 1976), psychological models of discourse (Foltz et al., 1998), rhetorical structure theory (Mann and Thompson, 1998), lexical chains (Morris and Hirst, 1991), and entity grid model (Barzilay and Lapata, 2008). More recently, the coherence model based on deep learning techniques has been intensively studied. These attempts include recursive and recurrent neural networks (Li and Hovy, 2014;Li and Jurafsky, 2017), a combination of LSTM and CNN (Mesgar and Strube, 2018), a deep coherence model based on CNN (Cui et al., 2017), SKIPFLOW LSTM (Tay et al., 2018), and a pre-trained generative model (Xu et al., 2019). It enables to encode patterns of semantic changes in a text. Despite some successes, techniques explored so far mainly rely on word sequence within a sentence. Farag and Yannakoudakis (2019) attempted to encode information about the types of grammatical roles, such as clausal modifiers of nouns and coordinating conjunction in a sentence obtained by using the Stanford Dependency Parser (Chen and Manning, 2014). for coherence modeling. The authors utilize a hierarchical neural network model trained two tasks, grammatical roles, and coherence in a multi-task manner.
In this paper, we propose a method for text clarity scoring by utilizing a coherence model as an auxiliary manner. Instead of training grammatical roles and coherence tasks with the same data, our method explicitly trains the coherence model by using an existing coherence relation training data and utilizes it for scoring text clarity on the target of scholarly papers. Several ideas were proposed for (a) There is a Eurocity train on Platform 1. (b) Its destination is Rome. (c) There is another Eurocity on Platform 2. (d) Its destination is Zürich. Figure 1: Coherence relation example from Wolf et al. (2003). Sentence (b) elaborates on sentence (a), sentence (a) is parallel with sentence (c). Furthermore, sentence (b) is in contrast with sentence (d).
categorizing coherence relations (Hobbs, 1979;Hovy, 1991;Sanders et al., 1992;Knott and Dale, 1994). We hypothesize that coherence relation knowledge learned from out-domain data is also possible to help to discriminate well-structured of the target text and thus help to score the text clarity. The method based on the language model pre-training BERT (Devlin et al., 2019) firstly trains the local coherence model by using BERT sentential encodings as an auxiliary manner and then re-trains it together with the text clarity scoring model, adapted from the delay multi-task approach (Shimura et al., 2019). In such a way, the coherence relation can also help the model to learn well-structured of a text. The main contributions of our work can be summarized: (1) We propose a method for scoring text clarity based on local coherence relation between two sentences/segments, (2) We introduce a learning framework that firstly trains the local coherence model as an auxiliary manner and then re-trains it together with the text clarity scoring model, (3) We show that the coherence relations learned from out-domain data are also possible to help for capturing the well-structured target text and thus beneficial for scoring the text clarity.

Learning Local Coherence with Sentential Encoders
To learn a coherence model, we utilize an existing coherence relation training data provided by Discourse Graphbank 1.0 2 (Wolf and Gibson, 2005). We used eleven coherence relations such as elaboration and parallel between adjacent sentences as "has coherence relation" and none relation between them as "does not have coherence relation". An example passage with coherence relations is shown in Figure 1. Figure  2 illustrates our neural clarity learning framework. The left-hand side of the framework shows a clarity score prediction model, and the right-hand side indicates the local coherence model. The contextualized sentence representation that we used is BERT, a Bidirectional Transformer model (Devlin et al., 2019). A transformer encoder computes the representation of each token through an attention mechanism concerning the surrounding tokens. As shown in the right-hand side of Figure 2, the input of the BERT is a pair (S i ,S j ) of two adjacent sentences S i = {w 1 , · · · , w m } and S j = {w 1 , · · · , w n } that are concatenated by a special token [SEP]. It consists of tokens that are segmented by BERT tokenizer using WordPiece embeddings vocabulary (Wu et al., 2016). The representation of each token is the sum of the corresponding token, segment, and position embeddings. The first token of every input is the special token [CLS], and the final hidden state corresponding to this [CLS] token is regarded as an aggregated representation of the input sentence pair. We used this aggregated representation as our sentential embedding of a sentence pair. Let w ∈ R d×(m+n) be a sentential embedding of a two-sentence pair (S i ,S j ). The sentential embedding is passed to the fully connected layer FC co and applies the softmax function to obtain probabilities of two predicted labels, "has coherence relation" or "does not have coherence relation" in the output layer. The network is trained with the objective that minimizes the binary cross-entropy loss between the predicted distributions and the actual distributions (one-hot vectors).

Learning Clarity Score with Paragraph Encoders
Like the coherence model, which takes a pair of two sentences as input and coherence labels as output, the clarity scoring model accepts a paragraph and clarity score as a training instance shown in the lefthand side of Figure 2 Figure 2: Neural Clarity Learning Framework the paragraph. Then it is passed through an applied dropout (Hinton et al., 2012) fully connected layer, FC cs . The dropout randomly sets values in the layer to 0. Finally, we obtain the score from linear regression. The network is trained with the objective that minimizes the mean-square error (M SE).

Clarity Score Prediction
We hypothesize that the coherence relation information learned from out-domain data also helps assess the clarity. For the sentence pair encoding results obtained from the coherence model, we apply a maxpooling operation over the sentence pair timeline. These features from the pooling layer are concatenated with the paragraph encoding obtained from the score prediction model, then passed to a fully connected layer FC cs . The two models' parameters are trained to minimize the M SE.

Experimental Setup
We chose the Discourse Graphbank dataset to learn the coherence model. We obtained 8,101 pairs of two adjacent segments from 135 annotated texts 49% of them have coherence relation, and 51% do not. The coherence relations we used consist of elaboration (1,831), attribution (662), parallel (507), cause-effect (340), contrast (189), temporal sequence (135), condition (108), others (183), and none (4,146) relation where the value marked with brackets shows the number of relations in our data. "None" denotes noncoherence relation. We divide these pairs into training, validation, and test data.
The PeerRead dataset (Kang et al., 2018) was selected as a benchmark. We chose ACL 2017 and ICLR 2017, having clarity scores and merged them. We used the Introduction section in our experiments. As a result, we obtained 212 papers, 1,081 paragraphs, which has more than one sentence. We also divided these paragraphs into three folds. Tables 1 and 2    The coherence model adopted BERT base model as a pre-training then fine-tuned on the Discourse Graphbank dataset. The model takes a segment pair as input and learns to predict whether it has a coherence relation or not. We tuned model hyperparameters, batch size, and learning rate (Adam). We obtained the model with 86% accuracy. When we trained the combined model, the model hyperparameters were fixed, i.e., batch size 8, learning rate 3e-5. Likewise to other baselines, these values were fixed, except the models from PeerRead, we followed the original. We performed the test with ten random initializations of the weight matrices and data splits.

Results
We compared our model to the baselines: (i) Three models from PeerRead (Kang et al., 2018), (ii) The same three models with BERT base as word embedding, (iii) Single model which is text clarity scoring based on BERT base but without coherence identification, (iv) Combined-no-DG is a method that utilizes sentence pair encodings obtained by BERT and does not train a model by using Discourse Graphbank dataset, and (v) Combined method, which utilizes the coherence model pre-trained on Discourse Graphbank dataset, then fine-tunes along with trained single model. The results are shown in Table 3 Figure 3: Performance against the percentage of the training data: We ran ten times for each volume of training data size except for 100% and obtained the average M SE.
We can see from Table 3 that the result obtained by "Combined" is the best among "Combined-no-DG" and "Single". Moreover, it is better than the three models from PeerRead and the same models with BERT base. The result obtained by "Combined-no-DG" is worse than that obtained by "Single" indicates that paragraph encoding is advantageous in predicting text clarity score, but sentence pair obtained by BERT sentence encoding is not, without coherence relation knowledge.
We also examined how the percentage of training data affects overall performance. Figure 3 shows an M SE against the percentage of the PeerRead training data by "Combined" and "BERT-DAN", which are the best and the second result. We ran ten times for each volume of training data size except for 100% and obtained the averaged M SE. Overall, the curves show that more training data helps the performance, while the curves obtained by our model "Combined" increase slowly compared to BERT-DAN, which indicates that local coherence relation between two adjacent phrases/sentences can help to learn a better model for scoring text clarity.

Conclusion
We proposed a method for scoring text clarity by utilizing local coherence between adjacent sentences. The experimental results using the PeerRead benchmark dataset show the improvement compared with a single model, scoring text clarity model. Future work will include (i) incorporating the global coherence model into our framework to improve the overall performance of scoring text clarity, (ii) extending our model to capture other criteria such as word choice and grammar for modeling clarity, and (iii) applying our model to the Essay Scoring task.