Dynamic Entity Representations in Neural Language Models

Understanding a long document requires tracking how entities are introduced and evolve over time. We present a new type of language model, EntityNLM, that can explicitly model entities, dynamically update their representations, and contextually generate their mentions. Our model is generative and flexible; it can model an arbitrary number of entities in context while generating each entity mention at an arbitrary length. In addition, it can be used for several different tasks such as language modeling, coreference resolution, and entity prediction. Experimental results with all these tasks demonstrate that our model consistently outperforms strong baselines and prior work.


Introduction
Understanding a narrative requires keeping track of its participants over a long-term context. As a story unfolds, the information a reader associates with each character in a story increases, and expectations about what will happen next change accordingly. At present, models of natural language do not explicitly track entities; indeed, in today's language models, entities are no more than the words used to mention them.
In this paper, we endow a generative language model with the ability to build up a dynamic representation of each entity mentioned in the text. Our language model defines a probability distribution over the whole text, with a distinct generative story for entity mentions. It explicitly groups those mentions that corefer and associates with each entity a continuous representation that is updated by every contextualized mention of the entity, and that in turn affects the text that follows.  Figure 1: ENTITYNLM explicitly tracks entities in a text, including coreferring relationships between entities like [John] 1 and [He] 1 . As a language model, it is designed to predict that a coreferent of [the coffee shop] 2 is likely to follow "told that," that the referring expression will be "it", and that "sold the best beans" is likely to come next, by using entity information encoded in the dynamic distributed representation.
Our method builds on recent advances in representation learning, creating local probability distributions from neural networks. It can be understood as a recurrent neural network language model, augmented with random variables for entity mentions that capture coreference, and with dynamic representations of entities. We estimate the model's parameters from data that is annotated with entity mentions and coreference.
Because our model is generative, it can be queried in different ways. Marginalizing everything except the words, it can play the role of a language model. In §5.1, we find that it outperforms both a strong n-gram language model and a strong recurrent neural network language model on the English test set of the CoNLL 2012 shared task on coreference evaluation (Pradhan et al., 2012). The model can also identify entity mentions and coreference relationships among them. In §5.2, we show that it can easily be used to add a performance boost to a strong coreference resolution system, by reranking a list of k-best candidate outputs. On the CoNLL 2012 shared task test set, the reranked outputs are significantly better than the original top choices from the same system. Fi-nally, the model can perform entity cloze tasks. As presented in §5.3, it achieves state-of-the-art performance on the InScript corpus (Modi et al., 2017).

Model
A language model defines a distribution over sequences of word tokens; let X t denote the random variable for the tth word in the sequence, x t denote the value of X t and x t the distributed representation (embedding) of this word. Our starting point for language modeling is a recurrent neural network (Mikolov et al., 2010), which defines where W h and b are parameters of the model (along with word embeddings x t ), LSTM is the widely used recurrent function known as "long short-term memory" (Hochreiter and Schmidhuber, 1997), and h t is a LSTM hidden state encoding the history of the sequence up to the tth word. Great success has been reported for this model (Zaremba et al., 2015), which posits nothing explicitly about the words appearing in the text sequence. Its generative story is simple: the value of each X t is randomly chosen conditioned on the vector h t−1 encoding its history.

Additional random variables and representations for entities
To introduce our model, we associate with each word an additional set of random variables. At position t, • R t is a binary random variable that indicates whether x t belongs to an entity mention (R t = 1) or not (R t = 0). Though not explored here, this is easily generalized to a categorial variable for the type of the entity (e.g., person, organization, etc.).
• L t ∈ {1, . . . , max } is a categorical random variable if R t = 1, which indicates the number of remaining words in this mention, including the current word (i.e., L t = 1 for the last word in any mention). max is a predefined maximum length fixed to be 25, which is an empirical value derived from the training corpora used in the experiments. If R t = 0, then L t = 1. We denote the value of L t by t .
• E t ∈ E t is the index of the entity referred to, if R t = 1. The set E t consists of {1, . . . , 1 + max t <t e t }, i.e., the indices of all previously mentioned entities plus an additional value for a new entity. Thus E t starts as {1} and grows monotonically with t, allowing for an arbitrary number of entities to be mentioned. We denote the value of E t by e t . If R t = 0, then E t is fixed to a special value ø.
The values of these random variables for our running example are shown in Figure 2. In addition to using symbolic variables to encode mentions and coreference relationships, we maintain a vector representation of each entity that evolves over time. For the ith entity, let e i,t be its representation at time t. These vectors are different from word vectors (x t ), in that they are not parameters of the model. They are similar to history representations (h t ), in that they are derived through parameterized functions of the random variables' values, which we will describe in the next subsections.

Generative story
The generative story for the word (and other variables) at timestep t is as follows; forwardreferenced equations are in the detailed discussion that follows.
• If r t = 0, set t = 1 and e t = ø; then go to step 3. Otherwise: -If there is no embedding for the new candidate entity with index 1 + max t <t e t , create one following §2.4.
3. Sample x t from the word distribution given the LSTM hidden state h t−1 and the current  6. For every entity e ι ∈ E t \ {e t }, set e ι,t = e ι,t−1 (i.e., no changes to other entities' representations).
Note that at any given time step t, e current will always contain the most recent vector representation of the most recently mentioned entity. A generative model with a similar hierarchical structure was used by Haghighi and Klein (2010) for coreference resolution. Our approach differs in two important ways. First, our model defines a joint distribution over all of the text, not just the entity mentions. Second, we use representation learning rather than Bayesian nonparametrics, allowing natural integration with the language model.

Probability distributions
The generative story above referenced several parametric distributions defined based on vector representations of histories and entities. These are defined as follows.
For r ∈ {0, 1}, where r is the parameterized embedding associated with r, which paves the way for exploring entity type representations in future work; W r is a parameter matrix for the bilinear score for h t−1 and r.
To give the possibility of predicting a new entity, we need an entity embedding beforehand with index (1 + max t <t e t ), which is randomly sampled from Equation 7. Then, for every e ∈ {1, . . . , 1 + max t <t e t }: where e e,t−1 is the embedding of entity e at time step t−1 and W entity is the weight matrix for predicting entities using their continuous representations. The score above is normalized over values {1, . . . , 1 + max t <t e t }. f (e) represents a vector of distance features associated with e and the mentions of the existing entities. Hence two information sources are used to predict the next entity: (i) contextual information h t−1 , and (ii) distance features f (e) from the current mention to the closest mention from each previously mentioned entity. f (e) = 0 if e is a new entity. This term can also be extended to include other surface-form features for coreference resolution (Martschat and Strube, 2015;Clark and Manning, 2016b). For the chosen entity e t from Equation 4, the distribution over its mention length is drawn according to where e et,t−1 is the most recent embedding of the entity e t , not updated with h t . The intuition is that e et,t−1 will help contextual information h t−1 to select the residual length of entity e t . W length is the weight matrix for length prediction, with max = 25 rows. Finally, the probability of a word x as the next token is jointly modeled by h t−1 and the vector representation of the most recently mentioned entity e current : where W e is a transformation matrix to adjust the dimensionality of e current . CFSM is a class factorized softmax function (Goodman, 2001;Baltescu and Blunsom, 2015). It uses a two-step prediction with predefined word classes instead of direct prediction on the whole vocabulary, and reduces the time complexity to the log of vocabulary size.

Dynamic entity representations
Before predicting the entity at step t, we need an embedding for the new candidate entity with index e = 1 + max t <t e t if it does not exist. The new embedding is generated randomly, according to a normal distribution, then projected onto the unit ball: where σ = 0.01. The time step t − 1 in e e ,t−1 means the current embedding contains no information from step t, although it will be updated once we have h t and if E t = e . r 1 is the parameterized embedding for R t = 1, which will be jointly optimized with other parameters and is expected to encode some generic information about entities. All the initial entity embeddings are centered on the mean r 1 , which is used in Equation 3 to determine whether the next token belongs to an entity mention. Another choice would be to initialize with a zero vector, although our preliminary experiments showed this did not work as well as random initialization in Equation 7. Assume R t = 1 and E t = e t , which means x t is part of a mention of entity e t . Then, we need to update e et,t−1 based on the new information we have from h t . The new embedding e et,t is a convex combination of the old embedding (e et,t−1 ) and current LSTM hidden state (h t ) with the interpolation (δ t ) determined dynamically based on a bilinear function: This updating scheme will be used to update e t in each of all the following t steps. The projection in the last step keeps the magnitude of the entity embedding fixed, avoiding numeric overflow. A similar updating scheme has been used by Henaff et al. (2016) for the "memory blocks" in their recurrent entity network models. The difference is that their model updates all memory blocks in each time step. Instead, our updating scheme in Equation 8 only applies to the selected entity e t at time step t.

Training objective
The model is trained to maximize the log of the joint probability of R, E, L, and X: where θ is the collection of all the parameters in this model. Based on the formulation in §2.3, Equation 9 can be decomposed as the sum of conditional log-probabilities of each random variable at each time step. This objective requires the training data annotated as in Figure 2. We do not assume that these variables are observed at test time.

Implementation Details
Our model is implemented with DyNet (Neubig et al., 2017) and available at https:// github.com/jiyfeng/entitynlm. We use AdaGrad (Duchi et al., 2011) with learning rate λ = 0.1 and ADAM (Kingma and Ba, 2014) with default learning rate λ = 0.001 as the candidate optimizers of our model. For all the parameters, we use the initialization tricks recommended by Glorot and Bengio (2010). To avoid overfitting, we also employ dropout (Srivastava et al., 2014) with the candidate rates as {0.2, 0.5}.
In addition, there are two tunable hyperparameters of ENTITYNLM: the size of word embeddings and the dimension of LSTM hidden states. For both of them, we consider the values {32, 48, 64, 128, 256}. We also experiment with the option to either use the pretrained GloVe word embeddings (Pennington et al., 2014) or randomly initialized word embeddings (then updated during training). For all experiments, the best configuration of hyperparameters and optimizers is selected based on the objective value on the development data.

Evaluation Tasks and Datasets
We evaluate our model in diverse use scenarios: (i) language modeling, (ii) coreference resolution, and (iii) entity prediction. The evaluation on language modeling shows how the internal entity representation, when marginalized out, can improve the perplexity of language models. The evaluation on coreference resolution experiment shows how our new language model can improve a competitive coreference resolution system. Finally, we employ an entity cloze task to demonstrate the generative performance of our model in predicting the next entity given the previous context.
We use two datasets for the three evaluation tasks. For language modeling and coreference resolution, we use the English benchmark data from the CoNLL 2012 shared task on coreference resolution (Pradhan et al., 2012). We employ the standard training/development/test split, which includes 2,802/343/348 documents with roughly 1M/150K/150K tokens, respectively. We follow the coreference annotation in the CoNLL dataset to extract entities and ignore the singleton mentions in texts.
For entity prediction, we employ the InScript corpus created by Modi et al. (2017). It consists of 10 scenarios, including grocery shopping, taking a flight, etc. It includes 910 crowdsourced simple narrative texts in total and 18 stories were ignored due to labeling problems (Modi et al., 2017). On average, each story has 12.4 sentences, 24.9 entities and 217.2 tokens. Each entity mention is labeled with its entity index. We use the same training/development/test split as in (Modi et al., 2017), which includes 619, 91, 182 texts, respectively.

Data preprocessing
For the CoNLL dataset, we lowercase all tokens and remove any token that only contains a punctuation symbol unless it is in an entity mention. We also replace numbers in the documents with the special token NUM and low-frequency word types with UNK. The vocabulary size of the CoNLL data after preprocessing is 10K. For entity mention extraction, in the CoNLL dataset, one entity mention could be embedded in another. For embedded mentions, only the enclosing entity mention is kept. We use the same preprocessed data for both language modeling and coreference resolution evaluation.
For the InScript corpus, we apply similar data preprocessing to lowercase all tokens, and we replace low-frequency word types with UNK. The vocabulary size after preprocessing is 1K.

Experiments
In this section, we present the experimental results on the three evaluation tasks.

Language modeling
Task description. The goal of language modeling is to compute the marginal probability: However, due to the long-range dependency in recurrent neural networks, the search space of R, E, L during inference grows exponentially. We thus use importance sampling to approximate the marginal distribution of X. Specifically, with the samples from a proposal distribution Q(R, E, L|X), the approximated marginal probability is defined as A similar idea of using importance sampling for language modeling evaluation has been used by . For language modeling evaluation, we train our model on the training set from the CoNLL 2012 dataset with coreference annotation. On the test data, we treat coreference structure as latent variables and use importance sampling to approximate the marginal distribution of X. For each document, the model randomly draws N = 100 samples from the proposal distribution, discussed next.
Proposal distribution. For implementation of Q, we use a discriminative variant of ENTI-TYNLM by taking the current word x t for predicting the entity-related variables in the same time step. Specifically, in the generative story described in §2.2, we delete step 3 (words are not generated, but rather conditioned upon), move step 4 before step 1, and replace h t−1 with h t in the steps for predicting entity type R t , entity E t and mention length L t . This model variant provides a conditional probability Q(R t , E t , L t | X t ) at each timestep.
Baselines. We compare the language modeling performance with two competitive baselines: 5gram language model implemented in KenLM (Heafield et al., 2013) and RNNLM with LSTM units implemented in DyNet (Neubig et al., 2017). For RNNLM, we use the same hyperparameters described in §3 and grid search on the development data to find the best configuration.
Results. The results of ENTITYNLM and the baselines on both development and test data are reported in Table 1. For ENTITYNLM, we use the value of 2 − 1 T T t=1 log P (Xt,Rt,Et,Lt) on the development set with coreference annotation to select the best model configuration and report the best number. On the test data, we are able to calculate perplexity by marginalizing all other random variables using Equation 11. To compute the perplexity numbers on the test data, our model only takes account of log probabilities on word prediction. The difference is that coreference information is only used for training ENTITYNLM and not for test. All three models reported in Table 1 share the same vocabulary, therefore the numbers on the test data are directly comparable. As shown in Table 1, ENTITYNLM outperforms both the 5gram language model and the RNNLM on the test data. Better performance of ENTITYNLM on language modeling can be expected, if we also use the marginalization method defined in Equation 11 on the development data to select the best configuration. However, we plan to use the same experimental setup for all experiments, instead of customizing our model for each individual task.

Coreference reranking
Task description. We show how ENTITYLM, which allows an efficient computation of the probability P (R, E, L, X), can be used as a coreference reranker to improve a competitive coreference resolution system due to Martschat and Strube (2015). This task is analogous to the reranking approach used in machine translation (Shen et al., 2004). The specific formulation is as follows: arg max where K is the k-best list for a given document.
In our experiments, k = 100. To the best of our knowledge, the problem of obtaining k-best outputs of a coreference resolution system has not been studied before.
Approximate k-best decoding. We rerank the output of a system that predicts an antecedent for each mention by relying on pairwise scores for mention pairs. This is the dominant approach for coreference resolution (Martschat and Strube, 2015;Clark and Manning, 2016a). The predictions induce an antecedent tree, which represents antecedent decisions for all mentions in the document. Coreference chains are obtained by transitive closure over the antecedent decisions encoded in the tree. A mention also can have an empty mention as antecedent, which denotes that the mention is non-anaphoric. For extending Martschat and Strube's greedy decoding approach to k-best inference, we cannot simply take the k highest scoring trees according to the sum of edge scores, because different trees may represent the same coreference chain. Instead, we use an heuristic that creates an approximate k-best list on candidate antecedent trees. The idea is to generate trees from the original system output by considering suboptimal antecedent choices that lead to different coreference chains. For each mention pair (m j , m i ), we compute the difference of its score to the score of the optimal antecedent choice for m j . We then sort pairs in ascending order according to this difference and iterate through the list of pairs. For each pair (m j , m i ), we create a tree t j,i by replacing the antecedent of m j in the original system output with m i . If this yields a tree that encodes different coreference chains from all chains encoded by trees in the k-best list, we add t i,j to the k-best list. In the case that we cannot generate a given number of trees (particularly for a short document with a large k), we pad the list with the last item added to the list.
Evaluation measures. For coreference resolution evaluation, we employ the CoNLL scorer (Pradhan et al., 2014). It computes three commonly used evaluation measures MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAF e (Luo, 2005). We report the F 1 score of each evaluation measure and their average as the CoNLL score.
Competing systems. We employed CORT 1 (Martschat and Strube, 2015) as our baseline coreference resolution system. Here, we compare with the original (one best) outputs of CORT's latent ranking model, which is the bestperforming model implemented in CORT. We consider two rerankers based on ENTITYNLM. The first reranking method only uses the log probability for ENTITYNLM to sort the candidate list (Equation 12). The second method uses a linear combination of both log probabilities from ENTITYNLM and the scores from CORT, where the coefficients were found via grid search with the CoNLL score on the development set.
Results. The reranked results on the CoNLL 2012 test set are reported in Table 2. The numbers of the baseline are higher than the results reported in Martschat and Strube (2015) since the feature set of CORT was subsequently extended. Lines 2 and 3 in Table 2 present the reranked best results. As shown in this table, both reranked results give more than 1% of CoNLL score improvement on the test set over CORT, which are significant based on an approximate randomization test 2 .
Additional experiments also found that increasing k from 100 to 500 had a minor effect. That is because the diversity of each k-best list is limited by (i) the number of entity mentions in the document, (ii) the performance of the baseline coreference resolution system, and possibly (iii) the approximate nature of our k-best inference procedure. We suspect that a stronger baseline system (such as that of Clark and Manning, 2016a) could give greater improvements, if it can be adapted to provide k-best lists. Future work might incorporate the techniques embedded in such systems into ENTITYNLM.  Figure 3: A short story on bicycles from the InScript corpus (Modi et al., 2017). The entity prediction task requires predicting xxxx given the preceding text either by choosing a previously mentioned entity or deciding that this is a "new entity". In this example, the ground-truth prediction is [tire] 4 . For training, ENTITYNLM attempts to predict every entity. While, for testing, it predicts a maximum of 30 entities after the first three sentences, which is consistent with the experimental setup suggested by Modi et al. (2017).

Entity prediction
Task description. Based on Modi et al. (2017), we introduce a novel entity prediction task that tries to predict the next entity given the preceding text. For a given text as in Figure 3, this task makes a forward prediction based on only the left context. This is different from coreference resolution, where both left and right contexts from a given entity mention are used in decoding. It is also different from language modeling, since this task only requires predicting entities. Since EN-TITYNLM is generative, it can be directly applied to this task. To predict entities in test data, R t is always given and ENTITYNLM only needs to predict E t when R t = 1.
Baselines and human prediction. We introduce two baselines in this task: (i) the always-new baseline that always predicts "new entity"; (ii) a linear classification model using shallow features from Modi et al. (2017), including the recency of an entity's last mention and the frequency. We also compare with the model proposed by Modi et al. (2017). Their work assumes that the model has prior knowledge of all the participant types, which are specific to each scenario and fine-grained, e.g., rider in the bicycle narrative, and predicts participant types for new entities. This assumption is unrealistic for pure generative models like ours.   Therefore, we remove this assumption and adapt their prediction results to our formulation by mapping all the predicted entities that have not been mentioned to "new entity". We also compare to the adapted human prediction used in the In-Script corpus. For each entity slot, Modi et al. (2017) acquired 20 human predictions, and the majority vote was selected. More details about human predictions are discussed in (Modi et al., 2017).
Results. Table 3 shows the prediction accuracies. ENTITYNLM (line 4) significantly outperforms both baselines (line 1 and 2) and prior work (line 3) (p 0.01, paired t-test). The comparison between line 4 and 5 shows our model is even close to the human prediction performance.

Related Work
Rich-context language models. The originally proposed recurrent neural network language models only capture information within sentences. To extend the capacity of RNNLMs, various researchers have incorporated information beyond sentence boundaries. Previous work focuses on contextual information from previous sentences (Ji et al., 2016a) or discourse relations between adjacent sentences (Ji et al., 2016b), showing improvements to language modeling and related tasks like coherence evaluation and discourse relation prediction. In this work, ENTITYNLM adds explicit entity information to the language model, which is another way of adding a memory network for language modeling. Unlike the work by Tran et al. (2016), where memory blocks are used to store general contextual information for language modeling, ENTITYNLM assigns each memory block specifically to only one entity.
Entity-related models. Two recent approaches to modeling entities in text are closely related to our model. The first is the "reference-aware" language models proposed by Yang et al. (2016), where the referred entities are from either a predefined item list, an external database, or the context from the same document. Yang et al. (2016) present three models, one for each case. For modeling a document with entities, they use coreference links to recover entity clusters, though they only model entity mentions as containing a single word (an inappropriate assumption, in our view). Their entity updating method takes the latest hidden state (similar to h t when R t = 1 in our model) as the new representation of the current entity; no long-term history of the entity is maintained, just the current local context. In addition, their language model evaluation assumes that entity information is provided at test time (Yang, personal communication), which makes a direct comparison with our model impossible. Our entity updating scheme is similar to the "dynamic memory" method used by Henaff et al. (2016). Our entity representations are dynamically allocated and updated only when an entity appears up, while the EntNet from Henaff et al. (2016) does not model entities and their relationships explicitly. In their model, entity memory blocks are pre-allocated and updated simultaneously in each timestep. So there is no dedicated memory block for every entity and no distinction between entity mentions and non-mention words. As a consequence, it is not clear how to use their model for coreference reranking and entity prediction.
Coreference resolution. The hierarchical structure of our entity generation model is inspired by Haghighi and Klein (2010). They implemented this idea as a probabillistic graphical model with the distance-dependent Chinese Restaurant Process (Pitman, 1995) for entity assignment, while our model is built on a recurrent neural network architecture. The reranking method considered in our coreference resolution evaluation could also be extended with samples from additional coreference resolution systems, to produce more variety (Ng, 2005). The benefit of such a system comes, we believe, from the explicit tracking of each entity throughout the text, providing entityspecific representations. In previous work, such information has been added as features (Luo et al., 2004;Björkelund and Kuhn, 2014) or by computing distributed entity representations (Wiseman et al., 2016;Clark and Manning, 2016b). Our approach complements these previous methods.
Entity prediction. The entity prediction task discussed in §5.3 is based on work by Modi et al. (2017). The main difference is that we do not assume that all entities belong to a previously known set of entity types specified for each narrative scenario. This task is also closely related to the "narrative cloze" task of Chambers and Jurafsky (2008) and the "story cloze test" of Mostafazadeh et al. (2016). Those studies aim to understand relationships between events, while our task focuses on predicting upcoming entity mentions.

Conclusion
We have presented a neural language model, EN-TITYNLM, that defines a distribution over texts and the mentioned entities. It provides vector representations for the entities and updates them dynamically in context. The dynamic representations are further used to help generate specific entity mentions and the following text. This model outperforms strong baselines and prior work on three tasks: language modeling, coreference resolution, and entity prediction.