A Unified Neural Coherence Model

Recently, neural approaches to coherence modeling have achieved state-of-the-art results in several evaluation tasks. However, we show that most of these models often fail on harder tasks with more realistic application scenarios. In particular, the existing models underperform on tasks that require the model to be sensitive to local contexts such as candidate ranking in conversational dialogue and in machine translation. In this paper, we propose a unified coherence model that incorporates sentence grammar, inter-sentence coherence relations, and global coherence patterns into a common neural framework. With extensive experiments on local and global discrimination tasks, we demonstrate that our proposed model outperforms existing models by a good margin, and establish a new state-of-the-art.


Introduction
Coherence modeling involves building text analysis models that can distinguish a coherent text from incoherent ones.It has been a key problem in discourse analysis with applications in text generation, summarization, and coherence scoring.
Various linguistic theories have been proposed to formulate coherence, some of which have inspired development of many of the existing coherence models.These include the entity-based local models (Barzilay and Lapata, 2008;Elsner and Charniak, 2011b) that consider syntactic realization of entities in adjacent sentences, inspired by the Centering Theory (Grosz et al., 1995).Another line of research uses discourse relations between sentences to predict local coherence (Pitler and Nenkova, 2008;Lin et al., 2011).These methods are inspired by the discourse structure theories like Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) that formalizes coherence in terms of discourse relations.Other notable methods include word co-occurrence based local models (Soricut and Marcu, 2006), content (or topic distribution) based global models (Barzilay and Lee, 2004), and syntax based local and global models (Louis and Nenkova, 2012).
With the neural invasion, some of the above traditional models have got their neural versions with much improved performance.For example, Li and Hovy (2014) implicitly model syntax and intersentence relations using a neural framework that uses a recurrent (or recursive) layer to encode each sentence and a fully-connected layer with sigmoid activations to estimate coherence probability for every window of three sentences.Li and Jurafsky (2017) incorporate global topic information with an encoder-decoder architecture, which is also capable of generating discourse.Mesgar and Strube (2018) model change patterns of salient semantic information between sentences.Nguyen and Joty (2017); Mohiuddin et al. (2018) propose neural entity grid models using convolutions over distributed representations of entity transitions, and report state-of-the-art results in standard evaluation tasks on the Wall Street Journal (WSJ) corpus.
Traditionally coherence models have been evaluated on two kinds of tasks.The first kind includes synthetic tasks such as discrimination and insertion that evaluate the models directly based on their ability to identify the right order of the sentences in a text with different levels of difficulty (Barzilay and Lapata, 2008;Elsner and Charniak, 2011b).The latter kind of tasks evaluate the impact of coherence score as another feature in downstream applications like readability assessment and essay scoring (Barzilay and Lapata, 2008;Mesgar and Strube, 2018).
Although coherence modeling has come a long way in terms of novel models and innovative evaluation tasks (Elsner and Charniak, 2011a;Mohi-uddin et al., 2018), it is far from being solved.As we show later, state-of-the-art models often fail on harder tasks like local discrimination and insertion that ask the model to evaluate a local context (e.g., a 3-sentence window).This task has direct applications in utterance ranking (Lowe et al., 2015) or bot detection1 in dialogue, and for sentence ordering in summarization.
According to Grosz and Sidner (1986), three factors collectively contribute to discourse coherence: (a) the organization of discourse segments, (b) intention or purpose of the discourse, and (c) attention or focused items.The entitybased approaches capture attentional structure, the syntax-based approaches consider intention, and the organizational structure is largely captured by models that consider discourse relations and content (topic) distribution.Although methods like (Elsner et al., 2007;Li and Jurafsky, 2017) attempt to combine these aspects of coherence, to our knowledge, no methods consider all the three aspects together in a single framework.
In this paper, we propose a unified neural model that incorporates sentence grammar (intentional structure), discourse relations, attention and topic structures in a single framework.We use an LSTM sentence encoder with explicit language model loss to capture the syntax.Intersentence discourse relations are modeled with a bilinear layer, and a lightweight convolution-pooling is used to capture the attention and topic structures.We evaluate our models on both local and global discrimination tasks on the benchmark dataset.Our results show that our approach outperforms existing methods by a wide margin in both tasks.We have released our code at https://ntunlpsg.github.io/project/coherence/ncoh-emnlp19/for research purposes.

Related Works
Inspired by various linguistic theories of discourse, many coherence models have been proposed.In this section, we give a brief overview of the existing coherence models.
Motivated by the Centering Theory (Grosz et al., 1995), Barzilay andLapata (2005, 2008) proposed the entity-based local model for representing and assessing text coherence, which showed significant improvements in two out of three evaluation tasks.Their model represents a text by a two-dimensional array called entity grid that captures local transitions of discourse entities across sentences as the deciding patterns for assessing coherence.They consider the salience of the entities to distinguish between transitions of important entities from unimportant ones, by measuring the occurrence frequency of the entities.
Subsequent studies extended the basic entity grid model.By including non-head nouns as entities in the grid, Elsner and Charniak (2011b) gained significant improvements.They incorporate entity-specific features like named entity, noun class, and modifiers to distinguish between entities of different types, which led to further improvements.Instead of using the transitions of grammatical roles, Lin et al. (2011) model the transitions of discourse roles for entities.Feng and Hirst (2012) used the basic entity grid, but improved its learning to rank scheme.Their model learns not only from the original document and its permutations but also from ranking preferences among the permutations themselves.Guinaudeau and Strube (2013) proposed a graph-based unsupervised method for modeling text coherence.Assuming the sentences in a coherent discourse should share the same structural syntactic patterns, Louis and Nenkova (2012) introduced a coherence model based on syntactic patterns in text.Their proposed method comprises of local and global coherence model, where the former one captures the co-occurrence of structural features in adjacent sentences and the latter one captures the global structure based on clusters of sentences with similar syntax.
Our model also considers syntactic patterns through a biLSTM sentence encoder that is trained on an explicit language modeling loss.Compared to the entity grid and the syntax-based models, our model does not require any syntactic parser.
With the neural tsunami, some of the above traditional models have got their neural versions with better performance.Li and Hovy (2014) proposed a neural framework to compute the coherence score of a document by estimating coherence probability for every window of three sentences.Li and Jurafsky (2017) proposed two encoder-decoder models, where the first model incorporates global discourse information (e.g., topics) by feeding the output of a sentence-level HMM-LDA model (Gruber et al., 2007) and the second model is trained end-to-end with varia-tional inference.Our proposed model also models inter-sentence relations and global coherence patterns.We use a bi-linear layer to model relations between two consecutive sentences exclusively.Also, our global model implements a lightweight convolution that requires much less parameters, which gives better generalization.Moreover, we train the whole network end-to-end with a window-based adaptive pairwise ranking loss.Nguyen and Joty (2017) proposed a neural version of the entity grid model where they first transform the grammatical roles in a grid into their distributed representations.Then they employ a convolution operation over it to model entity transitions in the distributed space.Finally, they compute the coherence score from the convoluted features by a spatial max-pooling operation.The model is trained with a document-level (global) pairwise ranking loss.Mohiuddin et al. (2018) improve the neural entity grid model by lexicalizing its entity transitions They use off-the-shelf word embeddings to achieve better generalization with the lexicalized model.As we will demonstrate, because of the spatial-pooling operation, entity-based neural models are not sensitive to mismatch of local patterns in a document limiting their applicability to tasks that require local discrimination.Another crucial limitation of employing a document-level pairwise ranking loss is that the loss from the document-level permutation for a negative document may penalize the convolution kernel weights even if the local permutation matches that of the positive document.In contrast, we apply a window-level (local) adaptive pairwise ranking loss that gets activated only if the corresponding windows of the positive and negative documents differ.This way our model is sensitive to local patterns without penalizing the weights unfairly.We capture global patterns using a separate light-weight convolution module.

Proposed Model
Let D = (s 1 , • • • , s n ) be a document consisting of n sentences.Our goal is to assess its coherence score.Figure 1 provides an overview of our proposed unified coherence model.It has four components in a Siamese architecture (Bromley et al., 1993): (i) a sentence encoder ( §3.1), (ii) a local coherence model ( §3.2), (iii) a global coherence model ( §3.3), and (iv) a coherence scoring layer ( §3.4).For encoding a sentence, we first map each word of the sentence to its corresponding vector representation.We then use a bidirectional LSTM sentence encoder with explicit language model loss to capture the sentence grammar.Given the sentence representations, the local and global coherence model extract the respective coherence features.The local coherence model implements a bilinear layer to model inter-sentence discourse relations.This layer captures the local contexts of the document.To capture the attention (entity distribution) and topic structures, i.e., the global coherence of the document, our global coherence model uses a light-weight convolution (Wu et al., 2019) with average pooling.The coherence scoring is a linear layer that evaluates the coherence from the extracted features.The whole architecture is trained end-to-end with a pairwise ranking loss.In the following, we elaborate on different components of our proposed model.

Modeling Intention
A discourse has a purpose such as describing an event, explaining some results, evaluating a product, etc.As such, sentences in the discourse should support the purpose as a whole.The syntactic structure of the sentence can be used to model the intent structure (Louis and Nenkova, 2012).We use a bidirectional long short-term memory or bi-LSTM (Hochreiter and Schmidhuber, 1997) to encode each sentence into a vector representation while modeling its compositional structure.
For an input sentence s i = (w 1 , • • • , w m ) of length m, we first map each word w t to its corresponding vector representation e t ∈ R d , where d is the dimension of the word embedding.The LSTM recurrent layer then computes a compositional representation h t ∈ R p at every time step t by performing nonlinear transformations of the current time step word vector representation e t and the output of the previous time step h t−1 , where p is the number of features in the LSTM hidden state.The output of the last time step h m is considered as the representation of the sentence.A bi-LSTM processes a given sentence s i in two directions: from left-to-right and right-to-left, yielding a representation , where ';' denotes concatenation.
We train our sentence encoder with an explicit language model loss.A bidirectional language model combines a forward and a backward language model (LM).Similar to Peters et al. (2018), we jointly minimize the negative log-likelihood of the forward and backward directions: where − → θ lstm and ← − θ lstm are the parameters of the forward and backward LSTMs, and θ denote the rest of the parameters which are shared.

Modeling Inter-Sentence Relation
Discourse relations between sentences reflect the organizational structure of a discourse that can be used to evaluate the coherence of a text (Lin et al., 2011;Li and Hovy, 2014).To model intersentence discourse relations, we use a bilinear model.Our bi-LSTM sentence encoder gives a representation h i ∈ R 2p for each sentence s i in the document.We feed the representations of every two consecutive sentences (h i , h i+1 ) to this layer, which applies a bilinear transformation as: where W b ∈ R q×2p×2p is a learnable tensor, and b ∈ R q is a learnable bias vector.Here q is the number of output features (i.e., v i ∈ R q ).

Modeling Global Coherence Patterns
The model proposed so far captures only local information.However, global discourse phenomena like entity or topic distributions are also important for coherence evaluation (Barzilay and Lee, 2004;Elsner et al., 2007;Louis and Nenkova, 2012).
Global coherence is modeled in our architecture by a convolution-pooling mechanism.
As shown in the Figure 1, our global coherence sub-module takes the representations H = (h 1 , • • • , h n ) of all the sentences in a document generated by the bi-LSTM encoder.The module uses six convolution layers with residual connections, followed by an average pooling layer.Instead of using regular convolutions, we use lightweight convolution (Wu et al., 2019), which is built upon depth-wise convolution (Chollet, 2016).Depth-wise convolutions perform a convolution independently over every input channel which significantly reduces the number of parameters as shown in Figure 2.For a given input H ∈ R n×d , the output O ∈ R n×d of the depth-wise convolution with convolution weight W ∈ R d×k with kernel size k for element i and output dimension c can be written as:  The convolutions are done over the input dimensions.
Compared to the regular convolutions,depth-wise convolutions reduces the number of parameters from d 2 k to dk (note that d = 2p in our case).
Light-weight Convolutions make the depth-wise convolution even simpler by sharing groups of output channels and normalizing weights across the temporal dimension using a softmax.It has a fixed context window which determines the importance of context elements with a set of weights that do not change over time steps.For the i-th element in the sequence and output channel c, light-weight convolution computes: where g = cG d with G being the number of groups.The number of parameters with lightweight convolutions reduces to H.k. Wu et al. (2019) show that models equipped with lightweight convolution exhibit better generalization compared to regular convolutions.It is indeed crucial in our case since we use convolutions to model a document, with large kernel size it would be difficult to learn from small datasets compared to (sentence-level) machine translation datasets.
The light-weight convolution layers generate d feature maps f i ∈ R n , i = 1, ..., d for each input document by the convolutional operation over an input dimension across all the sentences in a document.Subsequently, global average pooling is performed over the extracted feature maps to achieve a global view of the input document.The achieved global feature u ∈ R d can be expressed as follows: where 1 ∈ R n is the vector of ones and n is the number of sentences in an input document.The global document level features u are then concatenated with the local features of each consecutive sentence pair (v i ; v i+1 ) in the document, i.e., the output of the bilinear layer (see Figure 1).

Coherence Scoring
We then feed the concatenated global and local features z i to the final linear layer of our model to compute the coherence score y i ∈ R n for each local window.
where w l is weight vector and b l is a bias.The final decision on documents is made by summing up all local scores of documents and compares the summed scores.

Overall Objective and Training Details
Our model assigns a coherence score y i to every possible local window D in the document D, where is the local window index.During implementation, the input document is padded, so that the number of possible local window is the same as the number of sentences (n) in the document D.
Let y = Ω(D|Θ) define our model that produces the coherence scores y = (y 1 , . . ., y n ) for an input document D, with Θ being the parameters.We use a window-level pairwise ranking approach (Collobert et al., 2011) to learn Θ.
Our training set contains ordered pairs of documents (D pos , D neg ), where document D pos exhibits a higher degree of coherence than document D neg .See Section 4 for details about the dataset.We seek to learn Θ that assigns higher coherence scores to D pos than to D neg .We observed that the naive pairwise ranking loss that uses a fixed margin unfairly penalizes the locally positive sentences during training.In other words, the loss should be active only for local windows that differ in D pos and D neg .To address this, we propose to use an adaptive pairwise ranking loss L Θ defined as follows.
where φ(D pos , D neg ) is an adaptive margin given by where τ is a margin constant.Our total loss, L Θ = L Θ + L lm .Note that our model shares all the layers and components, i.e., Θ to obtain Ω(D pos |Θ) and Ω(D neg |Θ) from a pair of document (D pos , D neg ).Therefore, once trained, it can be used to score any input document independently.

Evaluation Tasks and Datasets
For comparison purposes with previous work, we evaluate our models on the standard "global" discrimination task (Barzilay and Lapata, 2008), where a document is compared to a random permutation of its sentences, which is considered to be incoherent.We also evaluate on an inverse discrimination task (Mohiuddin et al., 2018), where the sentences of the original document are placed in the reverse order to create the incoherent document.Similar to them, we do not train our models explicitly on this task, rather we use the trained model from the standard discrimination task.In addition and more importantly, we evaluate the models on a more challenging "local" discrimination task, where two documents differ only in a local context (e.g., a 3-sentence window), as shown with an example in Figure 3.
Dataset for Global Discrimination.We follow the same experimental setting of the WSJ news dataset as used in previous works (Mohiuddin et al., 2018;Nguyen and Joty, 2017;Elsner and Charniak, 2011b;Feng et al., 2014).Similar to them, we use 20 random permutations of each document for both training and testing, and exclude permutations that match the original one.We first set w as the number of local windows that we want to permute in a document.Based on this, we create four datasets for our local discrimination task: D w=1 , D w=2 , D w=3 and D w=1,2,3 .D w=1 contains the documents, where only one randomly selected window is permuted.Similarly, D w=2 contains the documents, where two randomly selected windows are permuted.D w=3 is similarly created for 3 windows.D w=1,2,3 denotes the concatenated datasets.The number of negative documents for each article was restricted not to exceed 20 samples.Additionally, we exclude the cases of the overlap between windows.In other words, the sentences are allowed to be permuted only inside their respective window.
We randomly select 10% of the training set for development purposes.Table 2 summarizes the datasets.Consequently, the training and the test dataset for D w=1,2,3 consists of 32,610 and 26,410 pairs, respectively.

Experiments
This section presents details of our experiment procedures and results.

Models Compared
We compare our proposed unified coherence model with several existing models.Some of the baselines that are not publicly available were reimplemented during experiments, otherwise we conducted experiments with publicly available codes, and the rest of the reported results are from their original papers.In the following sections, we present brief descriptions of the existing models.
"The House voted to boost the federal minimum wage for the first time since early 1981 , casting a solid 382-37 vote for a compromise measure backed by President Bush.""The vote came after a debate replete with complaints from both proponents and critics of a substantial increase in the wage floor.""Advocates said the 90-cent-an-hour rise , to $ 4.25 an hour by April 1991 , is too small for the working poor , while opponents argued that the increase will still hurt small business and cost many thousands of jobs.""But the legislation reflected a compromise agreed to on Tuesday by President Bush and Democratic leaders in Congress , after congressional Republicans urged the White House to bend a bit from its previous resistance to compromise." "So both sides accepted the compromise , which would lead to the first lifting of the minimum wage since a four-year law was enacted in 1977 , raising the wage to $ 3.35 an hour from $ 2.65." (a) Positive document "Advocates said the 90-cent-an-hour rise , to $ 4.25 an hour by April 1991 , is too small for the working poor , while opponents argued that the increase will still hurt small business and cost many thousands of jobs." "The vote came after a debate replete with complaints from both proponents and critics of a substantial increase in the wage floor." "The House voted to boost the federal minimum wage for the first time since early 1981 , casting a solid 382-37 vote for a compromise measure backed by President Bush." "But the legislation reflected a compromise agreed to on Tuesday by President Bush and Democratic leaders in Congress , after congressional Republicans urged the White House to bend a bit from its previous resistance to compromise." "So both sides accepted the compromise , which would lead to the first lifting of the minimum wage since a four-year law was enacted in 1977 , raising the wage to $ 3. Distributed Sentence Model (L&H).This is the neural model proposed by Li and Hovy (2014).Similar to our local model, it extracts local coherence features for small windows of sentences to compute the coherence score of a document.First, they use a recurrent or a recursive neural network to compute the representation for each sentence in the local window from its words and their pretrained embeddings.Then the concatenated vector is passed through a non-linear hidden layer, and finally the output layer decides if the window of sentences is a coherent text or not.The main differences between our implementation and the implementation referred in their paper are that we used a bi-LSTM (as opposed to simple RNN) for sentence encoding and trained the network with the Adam optimizer (as opposed to AdaGrad).
Grid-all nouns & Extended grid (E&C) 2 .Elsner and Charniak (2011b) report significant gains by including all nouns as entities in the original entity grid model as opposed to considering only head nouns.In their extended grid model, they used 9 additional entity-specific features, 4 of which are computed from external corpora.
2 https://bitbucket.org/melsner/browncoherence/src Neural Grid & Ext.Neural Grid (N&J)3 .These are the neural versions of the entity grid models as proposed by (Tien Nguyen and Joty, 2017).They use convolutions over grammatical roles to model entity transitions in the distributed space.In the extended model, they incorporate three entity-specific features.
(2018) improved the neural grid model by lexicalizing the entity transitions.Experiment results for this model were obtained with the optimal setting described in the original paper.
Global Coherence Model.This is the global coherence model component in our proposed unified model as described in Section 3.3.The model extracts document-level features through lightweight convolutions.The extracted features are subsequently averaged along the temporal dimension, which is in turn used in a linear layer for coherence scoring.This model used a kernel size of 5 and each document was padded by the size of 3.

Settings of Our Model
We held out 10% of the training documents to form a development set (DEV) on which we tune the hyper-parameters of our models.In our experiments, we use both word2vec (Mikolov et al., 2013) and ELMo (Peters et al., 2018) for the distributed representations of the words.Unlike word2vec, ELMo is capable of capturing both subword information and contextual clues.We implemented our models in PyTorch framework on a Linux machine with a single GTX 1080 Ti GPU.
During training, for optimization we use Adam optimizer (Kingma and Ba, 2015) with L 2 regularization (0.00001 regularization parameter).We trained the model up to 25 epochs to make the models' performance converge.To search for optimal parameters, we conducted various experiments while varying the hyper-parameters.Precisely, minibatch size in {5, 10, 20, 25}, sentence embedding size in {128, 256}, lightweight convolution kernel size in {3, 5, 7, 9}, bilinear output dimension size in {32, 64} are investigated.We present the optimal hyper-parameter values in the supplementary document.The results are reported by averaging over five different runs of the model with different seeds for statistical stability.

Results on Local Discrimination
Table 3 shows the results in accuracy on the "local" discrimination task.From the table, we see that existing models including our global model perform poorly compared to our proposed local models.They are likely to fail in distinguishing the text segments that are locally coherent and penalize them unfairly.One of the possible explanations of this phenomenon can be found in the nature of the global model.These models (except L&H) are designed to make a decision at a global level, thus they are likely to penalize locally coherent segments of a text.This observation is further bolstered by the performance of our local coherence models, which show higher sensitivity in discriminating locally coherent texts and achieve significantly accuracy compared to the baseline models and our global model.
Another aspect to notice here is that the performance of all the models become gradually better with the increase in the number of permutation windows in the dataset.This is not surprising because in the datasets with a lower number of permutation windows, the difference between a positive and a negative document is very subtle.For example, in D w=1 dataset, positive and negative documents differ only in a small window position.Another interesting observation regarding the entity-grid based neural models is that the model pretrained on the global discrimination task performs better than the ones trained on the specific tasks.From the table, we observe that our full model with ELMo word embeddings achieves the highest accuracies on the datasets D w=1,2,3 , D w=2 , and D w=3 , while on the D w=1 dataset, our full model with the pretrained word2vec embeddings performs the best.The reason could be that with more generalized contextual embeddings, our model losses the discrimination capa- bility for small changes in the document.From the table, we see that our unified neural coherence model outperforms the existing models by a good margin.In this dataset, our best model with the word2vec embeddings achieves 90.42% and 95.27%, on Standard and Inverse order discrimination tasks, respectively.We achieve the best results with our proposed model by using the ELMo word embeddings, where we get 93.19% and 96.78% accuracies on Standard and Inverse order discrimination tasks, respectively.

Ablation Study
To investigate the impact of different components in our proposed model, we conducted two sets of ablation study on the local and global discrimination tasks.Specifically, we want to see: (i) the impact of our global model component, and (ii) the impact of the language model (LM) loss.
Local Discrimination.In the local discrimination task, we first compare the performance of the proposed model without the LM loss.As shown in the first block of the global model improves the performance across all the datasets on the local discrimination task.
On the other hand, the addition of the LM loss to our model (with/without global model) increases the accuracy in most of the datasets and embeddings.Exception is the ELMo embeddings on D w=1 dataset, where the overall performance drops by 1.60% and 0.23% for the local model with and without the global model, respectively.
Another interesting observation on D w=1 dataset is that in all the cases word2vec embeddings outperforms ELMo.This unusual behavior of D w=1 dataset to the rests is not surprising because it is the hardest dataset where the difference between the positive and the negative document is subtle.In this case, generally flexible and simple models outperform complex ones.
For the performance degradation of the global model in D w=1 case, we assume that in some texts, the global model fails to capture the significant feature from the locally negative region.Subsequently, the global feature is added into the score calculation at every local window, so the overall influence of the global model becomes bigger than that of the local model in the decision making.
Global Discrimination.We also studied the impact of our global model and the LM loss in the global discrimination task.As shown in Table 6, the addition of the global model and LM loss to the local model improves performance on the standard discrimination task by 1.34%.
However, the addition of global model impacts negatively on the inverse order task and degrades accuracies by 2.42% and 1.28% in the presence and absence of LM loss respectively.We suspect that the global model is adding noise because of the pooling operation, which throws away the spatial relation between sentences and provides the global information that is invariant to the sentence-  order.But in this task, order information is crucial.In the inverse order task, we get the best performance by adding the LM loss to our local model.

Conclusion
In this paper, we proposed a unified coherence model.The proposed model incorporates a local coherence model and a global coherence model to capture the sentence grammar (intentional structure), discourse relations, attention and topic structures in a single framework.The unified coherence model shows state of the art results on the standard coherence assessment tasks: the inverse-order and the global discrimination tasks.Also, our evaluation of the local discrimination task demonstrates the effectiveness of the unified coherence model in assessing global and local coherence of texts.

Figure 1 :
Figure1: An overview of the proposed coherence model (best viewed in color).The superscript ' ' above output of each component denotes negative outputs and the red shade represents incoherent portions in the document.Note that all network parameters and components are shared regardless of the input documents.

Figure 3 :
Figure 3: Sample data in the local permutation data set.(a) is the positive document, WSJ0098, and (b) is the negative version of the positive document, which is locally sentence-order permuted.

Table 1 :
Statistics of the WSJ news dataset used for "global" discrimination task.

Table 2 :
Statistics on the WSJ news dataset used for "local" discrimination task.The w denotes the number of permuted local windows in a document.

Table 3 :
Results in accuracy on the Local Discrim-

Table 4 :
Results in accuracy on the Global Discrimination task.
Table 4 presents the results in accuracy on the two "global" discrimination tasks -the Standard and the Inverse order discrimination.The reported results of the entity-grid models are from the original papers.'Lex.Neural Grid (M&J)(code)' refers to the results achieved by running the code released by Mohiuddin et al. (2018) on our machine.

Table 5 ,
addition of the global model to the local model degrades the performance on the D w=1 dataset by 1.17% and 1.21% for word2vec and ELMo embeddings, respectively.While for the other datasets, we see improvements in performance for the addition of the global model.However, in the presence of the LM loss (second block in Table5), the addition of

Table 5 :
Ablation study of different model components on the Local Discrimination task.

Table 6 :
Ablation study of different model components on the Global Discrimination task.