A Neural Local Coherence Model

We propose a local coherence model based on a convolutional neural network that operates over the entity grid representation of a text. The model captures long range entity transitions along with entity-specific features without loosing generalization, thanks to the power of distributed representation. We present a pairwise ranking method to train the model in an end-to-end fashion on a task and learn task-specific high level features. Our evaluation on three different coherence assessment tasks demonstrates that our model achieves state of the art results outperforming existing models by a good margin.


Introduction and Motivation
What distinguishes a coherent text from a random sequence of sentences is that it binds the sentences together to express a meaning as a whole -the interpretation of a sentence usually depends on the meaning of its neighbors. Coherence models that can distinguish a coherent from incoherent texts have a wide range of applications in text generation, summarization, and coherence scoring.
Several formal theories of coherence have been proposed (Mann and Thompson, 1988a;Grosz et al., 1995;Asher and Lascarides, 2003), and their principles have inspired development of many existing coherence models (Barzilay and Lapata, 2008;Lin et al., 2011;. Among these models, the entity grid (Barzilay and Lapata, 2008), which is based on Centering Theory (Grosz et al., 1995), is arguably the most popular, and has seen a number of improvements over the years. As shown in Figure 1, the entity grid model represents a text by a grid that captures how * Both authors contributed equally to this work. grammatical roles of different entities change from sentence to sentence. The grid is then converted into a feature vector containing probabilities of local entity transitions, which enables machine learning models to learn the degree of text coherence. Extensions of this basic grid model incorporate entity-specific features (Elsner and Charniak, 2011), multiple ranks (Feng and Hirst, 2012), and coherence relations (Feng et al., 2014).
While the entity grid and its extensions have been successful in many applications, they are limited in several ways. First, they use discrete representation for grammatical roles and features, which prevents the model from considering sufficiently long transitions (Bengio et al., 2003). Second, feature vector computation in existing models is decoupled from the target task, which limits the model's capacity to learn task-specific features.
In this paper, we propose a neural architecture for coherence assessment that can capture long range entity transitions along with arbitrary entityspecific features. Our model obtains generalization through distributed representations of entity transitions and entity features. We also present an end-to-end training method to learn task-specific high level features automatically in our model.
We evaluate our approach on three different evaluation tasks: discrimination, insertion, and summary coherence rating, proposed previously for evaluating coherence models (Barzilay and Lapata, 2008;Elsner and Charniak, 2011). Discrimination and insertion involve identifying the right order of the sentences in a text with different levels of difficulty. In the summary coherence rating task, we compare the rankings, given by the model, against human judgments of coherence.
The experimental results show that our neural models consistently improve over the nonneural counterparts (i.e., existing entity grid models) yielding absolute gains of about 4% on dis-crimination, up to 2.5% on insertion, and more than 4% on summary coherence rating. Furthermore, our model achieves state of the art results in all these tasks. We have released our source code for research purposes. 1 The remainder of this paper is organized as follows. We describe entity grid, its extensions, and its limitations in Section 2. In Section 3, we present our neural model. We describe evaluation tasks and results in Sections 4 and 5. We give a brief account of related work in Section 6. Finally, we conclude with future directions in Section 7.

Entity Grid and Its Extensions
Motivated by Centering Theory (Grosz et al., 1995), Barzilay and Lapata (2008) proposed an entity-based model for representing and assessing text coherence. Their model represents a text by a two-dimensional array called entity grid that captures transitions of discourse entities across sentences. As shown in Figure 1, the rows of the grid correspond to sentences, and the columns correspond to discourse entities appearing in the text. They consider noun phrases (NP) as entities, and employ a coreference resolver to detect mentions of the same entity (e.g., Obama, the president). Each entry G i,j in the entity grid represents the syntactic role that entity e j plays in sentence s i , which can be one of: subject (S), object (O), or other (X). In addition, entities not appearing in a sentence are marked by a special symbol (-). If an entity appears more than once with different grammatical roles in the same sentence, the role with the highest rank (S O X) is considered.
To represent the entity grid using a feature vector, Barzilay and Lapata (2008) compute probability for each local entity transition of length k (i.e., {S, O, X, −} k ), and represent each grid by a vector of 4 k transitions probabilities. To distinguish between transitions of important entities from unimportant ones, they consider the salience of the entities, which they quantify by their occurrence frequency in the document. Assessment of text coherence is then formulated as a ranking problem in an SVM preference ranking framework (Joachims, 2002).
Subsequent studies proposed to extend the basic entity grid model. Filippova and Strube (2007) attempted to improve the model by grouping en- s0: Eaton Corp. said it sold its Pacific Sierra Research unit to a company formed by employees of that unit. s1: Terms were not disclosed.
s2: Pacific Sierra, based in Los Angeles, has 200 employees and supplies professional services and advanced products to industry.
s3: Eaton is an automotive parts, controls and aerospace electronics concern.
tities based on semantic relatedness, but did not get significant improvement. Elsner and Charniak (2011) proposed a number of improvements. They initially show significant improvement by including non-head nouns (i.e., nouns that do not head NPs) as entities in the grid. 2 Then, they extend the grid to distinguish between entities of different types by incorporating entity-specific features like named entity, noun class, modifiers, etc. These extensions led to the best results reported so far. The Entity grid and its extensions have been successfully applied to many downstream tasks including coherence rating (Barzilay and Lapata, 2008), essay scoring (Burstein et al., 2010), story generation (McIntyre and Lapata, 2010), and readability assessment (Pitler et al., 2010;Barzilay and Lapata, 2008). They have also been critical components in state-of-the-art sentence ordering models (Soricut and Marcu, 2006;Elsner and Charniak, 2011;Lin et al., 2011).

Limitations of Entity Grid Models
Despite its success, existing entity grid models are limited in several ways.
• Existing models use discrete representation for grammatical roles and features, which leads to the so-called curse of dimensionality problem (Bengio et al., 2003). In particular, to model transitions of length k with R different grammatical roles, the basic entity grid model needs to compute R k tran-sition probabilities from a grid. One can imagine that the estimated distribution becomes sparse as k increases. This prevents the model from considering longer transitions -existing models use k ≤ 3. This problem is exacerbated when we want to include entity-specific features, as the number of parameters grows exponentially with the number of features (Elsner and Charniak, 2011).
• Existing models compute feature representations from entity grids in a task-agnostic way. In other words, feature extraction is decoupled from the target downstream tasks. This can limit the model's capacity to learn task-specific features. Therefore, models that can be trained in an end-toend fashion on different target tasks are desirable.
In the following section, we present a neural architecture that allows us to capture long range entity transitions along with arbitrary entity-specific features without loosing generalization. We also present an end-to-end training method to learn task-specific features automatically.

The Neural Coherence Model
Figure 2 summarizes our neural architecture for modeling local coherence, and how it can be trained in a pairwise fashion. The architecture takes a document as input, and first extracts its entity grid. 3 The first layer of the neural network transforms each grammatical role in the grid into a distributed representation, a real-valued vector. The second layer computes high-level features by going over each column (transitions) of the grid. The following layer selects the most important high-level features, which are in turn used for coherence scoring. The features computed at different layers of the network are automatically trained by backpropagation to be relevant to the task. In the following, we elaborate on the layers of the neural network model.
(I) Transforming grammatical roles into feature vectors: Grammatical roles are fed to our model as indices taken from a finite vocabulary V. In the simplest scenario, V contains {S, O, X, −}. However, we will see in Section 3.1 that as we include more entity-specific features, V can contain more symbols. The first layer of our network maps each of these indices into a distributed representation R d by looking up a shared embedding matrix E ∈ R |V|×d . We consider E a model parameter to be learned by backpropagation on a given task. We can initialize E randomly or using pretrained vectors trained on a general coherence task.
Given an entity grid G with columns representing entity transitions over sentences in a document, the lookup layer extracts a d-dimensional vector for each entry G i,j from E. More formally, (1) where E(G i,j ) refers to the row in E that corresponds to the grammatical role G i,j ∈ V; m is the total number of sentences and n is the total number of entities in the document. The output L(G) is a tensor in R m×n×d , which is fed to the next layer of the network as we describe below.
(II) Modeling entity transitions: The vectors produced by the lookup layer are combined by subsequent layers of the network to generate a coherence score for the document. To compose higher-level features from the embedding vectors, we make the following modeling assumptions: • Similar to existing entity grid models, we assume there is no spatio-temporal relation between the entities in a document. In other words, columns in a grid are treated independently.
• We are interested in modeling entity transitions of arbitrary lengths in a location-invariant way. This means, we aim to compose local patches of entity transitions into higher-level representations, while treating the patches independently of their position in the entity grid.
Under these assumptions, the natural choice to tackle this problem is to use a convolutional approach, used previously to solve other NLP tasks (Collobert et al., 2011;Kim, 2014).
Convolution layer: A convolution operation involves applying a filter w ∈ R k.d (i.e., a vector of weight parameters) to each entity transition of length k to produce a new abstract feature where L t:t+k−1,j denotes the concatenation of k vectors in the lookup layer representing a transition of length k for entity e j in the grid, b t is a bias term, and f is a nonlinear activation function, e.g., ReLU (Nair and Hinton, 2010) in our model. We apply this filter to each possible k-length transitions of different entities in the grid to generate a feature map, h i = [h 1 , · · · , h m.n+k−1 ]. We repeat this process N times with N different filters to get N different feature maps (Figure 2). Notice that we use a wide convolution (Kalchbrenner et al., 2014), as opposed to narrow, to ensure that the filters reach entire columns of a grid, including the boundary entities. This is done by performing zero-padding, where out-of-range (i.e., for t < 0 or t > {m, n}) vectors are assumed to be zero.
Convolutional filters learn to compose local transition features of a grid into higher-level representations automatically. Since it operates over the distributed representation of grid entries, compared to traditional grid models, the transition length k can be sufficiently large (e.g., 5 − 8 in our experiments) to capture long-range transitional dependencies without overfitting on the training data. Moreover, unlike existing grid models that compute transition probabilities from a single document, embedding vectors and convolutional filters are learned from all training documents, which helps the neural framework to obtain better generalization and robustness.
Pooling layer: After the convolution, we apply a max-pooling operation to each feature map.
where µ p (h i ) refers to the max operation applied to each non-overlapping 4 window of p features in the feature map h i . Max-pooling reduces the output dimensionality by a factor of p, and it drives the model to capture the most salient local features from each feature map in the convolutional layer.
Coherence scoring: Finally, the max-pooled features are used in the output layer of the network to produce a coherence score y ∈ R.
where v is the weight vector and b is a bias term.
Why it works: Intuitively, each filter detects a specific transition pattern (e.g., 'SS-O-X' for a coherent text), and if this pattern occurs somewhere in the grid, the resulting feature map will have a large value for that particular region and small values for other regions. By applying max pooling on this feature map, the network then discovers that the transition appeared in the grid.

Incorporating Entity-Specific Features
Our model as described above neuralizes the basic entity grid model that considers only entity transitions without distinguishing between types of the entities. However, as Elsner and Charniak (2011) pointed out entity-specific features could be crucial for modeling local coherence. One simple way to incorporate entity-specific features into our model is to attach the feature value (e.g., named entity type) with the grammatical role in the grid.
For example, if an entity e j of type PERSON appears as a subject (S) in sentence s i , the grid entry G i,j can be encoded as PERSON-S.

Training
Our neural model assigns a coherence score to an input document d based on the degree of local coherence observed in its entity grid G. Let y = φ(G|θ) define our model that transforms an input grid G to a coherence score y through a sequence of lookup, convolutional, pooling, and linear projection layers with parameter set θ. The parameter set θ includes the embedding matrix E, the filter matrix W , the weight vector v, and the biases. We use a pairwise ranking approach (Collobert et al., 2011) to learn θ. The training set comprises ordered pairs (d i , d j ), where document d i exhibits a higher degree of coherence than document d j . As we will see in Section 4 such orderings can be obtained automatically or through manual annotation. In training, we seek to find θ that assigns a higher coherence score to d i than to d j . We minimize the following ranking objective with respect to θ: where G i and G j are the entity grids corresponding to documents d i and d j , respectively. Notice that (also shown in Figure 2) the network shares its layers (and hence θ) to obtain φ(G i |θ) and φ(G j |θ) from a pair of input grids (G i , G j ).
Barzilay and Lapata (2008) adopted a similar ranking criterion using an SVM preference kernel learner as they argue coherence assessment is best seen as a ranking problem as opposed to classification (coherent vs. incoherent). Also, the ranker gives a scoring function φ that a text generation system can use to compare alternative hypotheses.

Evaluation Tasks
We evaluate the effectiveness of our coherence models on two different evaluation tasks: sentence ordering and summary coherence rating.

Sentence Ordering
Following Elsner and Charniak (2011), we evaluate our models on two sentence ordering tasks: discrimination and insertion.
In the discrimination task (Barzilay and Lapata, 2008), a document is compared to a random per- mutation of its sentences, and the model is considered correct if it scores the original document higher than the permuted one. We use 20 permutations of each document in the test set in accordance with previous work. In the insertion task (Elsner and Charniak, 2011), we evaluate models based on their ability to locate the original position of a sentence previously removed from a document. To measure this, each sentence in the document is removed in turn, and an insertion place is located for which the model gives the highest coherence score to the document. The insertion score is then computed as the average fraction of sentences per document reinserted in their actual position.
Discrimination can be easier for longer documents, since a random permutation is likely to be different than the original one. Insertion is a much more difficult task since the candidate documents differ only by the position of one sentence.  Table 1 gives basic statistics about the dataset. Following previous works, we use 20 random permutations of each article, and we exclude permutations that match the original document. 5 The fourth column (# Pairs) in Table 1 shows the resulting number of (original, permuted) pairs used for training our model and for testing in the discrimination task.
Some previous studies (Barzilay and Lapata, 2008; used the AIRPLANES and the EARTHQUAKES corpora, which contain reports on airplane crashes and earthquakes, respectively. Each of these corpora contains 100 articles for training and 100 articles for testing. The average number of sentences per article in these two corpora is 10.4 and 11.5, respectively. We preferred the WSJ corpus for several reasons. First and most importantly, the WSJ corpus is larger than other corpora (see Table 1). A large training set is crucial for learning effective deep learning models (Collobert et al., 2011), and a large enough test set is necessary to make a general comment about model performance. Second, as Elsner and Charniak (2011) pointed out, texts in AIRPLANES and EARTHQUAKES are constrained in style, whereas WSJ documents are more like normal informative articles. Third, we could reproduce results on this dataset for the competing systems (e.g., entity grid and its extensions) using the publicly available Brown coherence toolkit. 6

Summary Coherence Rating
We further evaluate our models on the summary coherence rating task proposed by Barzilay and Lapata (2008), where we compare rankings given by a model to a pair of summaries against rankings elicited from human judges.
Dataset: The summary dataset was extracted from the Document Understanding Conference (DUC'03), which contains 6 clusters of multidocument summaries produced by human experts and 5 automatic summarization systems. Each cluster has 16 summaries of a document with pairwise coherence rankings given by humans judges; see (Barzilay and Lapata, 2008) for details on the annotation method. There are 144 pairs of summaries for training and 80 pairs for testing.

Experiments
In this section, we present our experiments -the models we compare, their settings, and the results.

Models Compared
We compare our coherence model against a random baseline and several existing models.

Random: The Random baseline makes a random decision for the evaluation tasks.
Graph-based Model: This is the graph-based unsupervised model proposed by Guinaudeau and Strube (2013). We use the implementation from the cohere 7 toolkit (Smith et al., 2016), and run it on the test set with syntactic projection (command line option 'projection=3') for graph construction. This setting yielded best scores for this model.
Distributed Sentence Model:  proposed this neural model for measuring 6 https://bitbucket.org/melsner/browncoherence 7 https://github.com/karins/CoherenceFramework text coherence. The model first encodes each sentence in a document into a fixed-length vector using a recurrent or a recursive neural network. Then it computes the coherence score of the document by aggregating the scores estimated for each window of three sentences in the document. We used the implementation made publicly available by the authors. 8 We trained the model on our WSJ corpus with 512, 1024 and 1536 minibatch sizes for a maximum of 25 epochs. 9 The model that used minibatch size of 512 and completed 23 epochs achieved the best accuracy on the DEV set. We applied this model to get the scores on the TEST set.
Grid-all nouns (E&C): This is the simple extension of the original entity grid model, where all nouns are considered as entities. Elsner and Charniak (2011) report significant gains by considering all nouns as opposed to only head-nouns. Results for this model were obtained by training the baseline entity grid model (command line option '-n') in the Brown coherence toolkit on our dataset.

Extended grid (E&C):
This represents the extended entity grid model of Elsner and Charniak (2011) that uses 9 entity-specific features; 4 of them were computed from external corpora. This model considers all nouns as entities. For this system, we train the extended grid model (command line option '-f') in the Brown coherence toolkit.
Grid-CNN: This is our proposed neural extension of the basic entity grid (all nouns), where we only consider entity transitions as input.
Extended Grid-CNN: This corresponds to our neural model that incorporates entity-specific features following the method described in Section 3.1. To keep the model simple, we include only three entity-specific features from (Elsner and Charniak, 2011) that are easy to compute and do not require any external corpus. The features are: (i) named entity type, (ii) salience as determined by occurrence frequency of the entity, and  (iii) whether the entity has a proper mention.

Settings for Neural Models
We held out 10% of the training documents to form a development set (DEV) on which we tune the hyper-parameters of our neural models. For discrimination and insertion tasks, the resulting DEV set contains 138 articles and 2,678 pairs after removing the permutations that match the original documents. For the summary rating task, DEV contains 14 pairs of summaries. We implement our models in Theano (Theano Development Team, 2016). We use rectified linear units (ReLU) as activations (f ). The embedding matrix is initialized with samples from uniform distribution U (−0.01, 0.01), and the weight matrices are initialized with samples from glorotuniform distribution (Glorot and Bengio, 2010).
We train the models by optimizing the pairwise ranking loss in Equation 5 using the gradientbased online learning algorithm RMSprop with parameters (ρ and ) set to the values suggested by Tieleman and Hinton (2012). 10 We use up to 25 epochs. To avoid overfitting, we use dropout (Srivastava et al., 2014) of hidden units, and do early stopping by observing accuracy on the DEV set -if the accuracy does not increase for 10 consecutive epochs, we exit with the best model recorded so far. We search for optimal minibatch size in {16, 32, 64, 128}, embedding size in {80, 100, 200}, dropout rate in {0.2, 0.3, 0.5}, filter number in {100, 150, 200, 300}, window size in {2, 3, 4, 5, 6, 7, 8}, and pooling length in {3, 4, 5, 6, 7}. Table 2 shows the optimal hyperparameter setting for our models. The best model on DEV is then used for the final evaluation on the TEST set. We run each experiment five times, each time with a different random seed, and we report the average of the runs to avoid any randomness in results. Statistical significance tests are done using an approximate randomization test based on the accuracy. We used SIGF V.2 (Padó, 2006) with  Table 3: Coherence evaluation results on Discrimination and Insertion tasks. † indicates a neural model is significantly superior to its nonneural counterpart with p-value < 0.01.
10,000 iterations. Table 3 shows the results on discrimination and insertion tasks. The graph-based model gets the lowest scores. This is not surprising considering that this model works in an unsupervised way. The distributed sentence model surprisingly performed poorly on our dataset. Among the existing models, the grid models get the best scores on both tasks. This demonstrates that entity transition, as a method to capture local coherence, is more effective than the sentence representation method.

Results on Sentence Ordering
Neuralization of the existing grid models yields significant improvements in most cases. The Grid-CNN model delivers absolute improvements of about 4% in discrimination and 1% in insertion over the basic grid model. When we compare our Extended Grid-CNN with its non-neural counterpart Extended Grid, we observe similar gains in discrimination and more gains (2.5%) in insertion. Note that the Extended Grid-CNN yields these improvements considering only a subset of the Extended Grid features. This demonstrates the effectiveness of distributed representation and convolutional feature learning method. Compared to the discrimination task, gain in the insertion task is less verbose. There could be two reasons for this. First, as mentioned before, insertion is a harder task than discrimination. Second, our models were not trained specifically on the insertion task. The model that is trained to distinguish an original document from its random permutation may learn features that are not specific enough to distinguish documents when only one sentence differs. In the future, it will be interesting Acc  to see how the model performs when it is trained on the insertion task directly. Table 4 presents the results on the summary coherence rating task, where we compare our models with the reported results of the graph-based method (Guinaudeau and Strube, 2013) and the initial entity grid model (Barzilay and Lapata, 2008) on the same experimental setting. 11 The extended grid model does not use pairwise training, therefore could not be trained on the summarization dataset. Since there are not many training instances, our neural models may not learn well for this task. Therefore, we also present versions of our model, where we use pre-trained models from discrimination task on WSJ corpus (last two rows in the table ). The pre-trained models are then finetuned on the summary rating task.

Results on Summary Coherence Rating
We can observe that even without pre-training our models outperform existing models, and pretraining gives further improvements. Specifically, Pre-trained Grid-CNN gives an improvement of 2.5% over the Grid model, and including entity features pushes the improvement further to 3.7%. Lapata (2005, 2008) introduced the entity grid representation of discourse to model local coherence that captures the distribution of discourse entities across sentences in a text. They also introduced three tasks to evaluate the performance of coherence models: discrimination, summary coherence rating, and readability.

Related Work
A number of extensions of the basic entity grid model has been proposed. Elsner and Charniak (2011) included entity-specific features to distinguish between entities. Feng and Hirst (2012) used the basic grid representation, but improved its learning to rank scheme. Their model learns not only from original document and its permutations but also from ranking preferences among the permutations themselves. Guinaudeau and Strube (2013) convert a standard entity grid into a bipartite graph representing entity occurrences in sentences. To model local entity transition, the method constructs a directed projection graph representing the connection between adjacent sentences. Two sentences have a connected edge if they share at least one entity in common. The coherence score of the document is then computed as the average out-degree of sentence nodes.
In addition, there are some approaches that model text coherence based on coreferences and discourse relations. Elsner and Charniak (2008) proposed the discourse-new model by taking into account mentions of all referring expression (i.e., NPs) whether they are first mention (discoursenew) or subsequent (discourse-old) mentions. Given a document, they run a maximum-entropy classifier to detect each NP as a label L np ∈ {new, old}. The coherence score of the document is then estimated by np:N P s P (L np |np). In this work, they also estimate text coherence through pronoun coreference modeling. Lin et al. (2011) assume that a coherent text has certain discourse relation patterns. Instead of modeling entity transitions, they model discourse role transitions between sentences. In a follow up work, Feng et al. (2014) trained the same model but using features derived from deep discourse structures annotated with Rhetorical Structure Theory or RST (Mann and Thompson, 1988b) relations. Louis and Nenkova (2012) introduced a coherence model based on syntactic patterns in text by assuming that sentences in a coherent discourse should share the same structural syntactic patterns.
In recent years, there has been a growing interest in neuralizing traditional NLP approacheslanguage modeling (Bengio et al., 2003), sequence tagging (Collobert et al., 2011), syntactic parsing (Socher et al., 2013), and discourse parsing , etc. Following this tradition, in this paper we propose to neuralize the popular entity grid models.  also proposed a neural framework to compute the coherence score of a document by estimating coherence probability for every window of L sentences (in their experiments, L = 3). First, they use a recurrent or a recursive neural network to compute the representation for each sentence in L from its words and their pre-trained embeddings. Then the concatenated vector is passed through a non-linear hidden layer, and finally the output layer decides if the window of sentences is a coherent text or not. Our approach is fundamentally different from their approach; our model operates over entity grids, and we use convolutional architecture to model sufficiently long entity transitions.

Conclusion and Future Work
We presented a local coherence model based on a convolutional neural network that operates over the distributed representation of entity transitions in the grid representation of a text. Our architecture can model sufficiently long entity transitions, and can incorporate entity-specific features without loosing generalization power. We described a pairwise ranking approach to train the model on a target task and learn task-specific features. Our evaluation on discrimination, insertion and summary coherence rating tasks demonstrates the effectiveness of our approach yielding the best results reported so far on these tasks.
In future, we would like to include other sources of information in our model. Our initial plan is to include rhetorical relations, which has been shown to benefit existing grid models (Feng et al., 2014). We would also like to extend our model to other forms of discourse, especially, asynchronous conversations, where participants communicate with each other at different times (e.g., forum, email).