The Referential Reader: A Recurrent Entity Network for Anaphora Resolution

We present a new architecture for storing and accessing entity mentions during online text processing. While reading the text, entity references are identified, and may be stored by either updating or overwriting a cell in a fixed-length memory. The update operation implies coreference with the other mentions that are stored in the same cell; the overwrite operation causes these mentions to be forgotten. By encoding the memory operations as differentiable gates, it is possible to train the model end-to-end, using both a supervised anaphora resolution objective as well as a supplementary language modeling objective. Evaluation on a dataset of pronoun-name anaphora demonstrates strong performance with purely incremental text processing.


Introduction
Reference resolution is a fundamental problem in language understanding.Many systems employ the mention-pair model, in which a classifier is applied to all pairs of spans (e.g., Lee et al., 2017).While this model performs well in some settings, it is expensive in both computation and labeled data.Furthermore, a new challenge set of pronounname references shows that the best supervised systems are outperformed by a simple baseline that links pronouns to names with the same syntactic function, e.g., subject or object (Webster et al., 2018).The mention-pair model is also cognitively implausible: human readers must interpret text in a nearly online fashion.
In this paper, we present a new method for reference resolution, which reads the text left-to-right while storing entities in a fixed-size working memory (Figure 1).As each token is encountered, the reader must decide whether to: (a) link the token * Work carried out as an intern at Facebook AI Research to an existing memory, thereby creating a coreference link, (b) overwrite an existing memory and store a new entity, or (c) disregard the token and move ahead.As memories are reused, they receive increasing salience, making them less likely to be overwritten.
This online model for coreference resolution is based on the memory network architecture (Weston et al., 2015), in which memory operations are differentiable, enabling end-to-end training from gold anaphora resolution data.Furthermore, the memory can be combined with a recurrent hidden state, enabling prediction of the next student, and a corresponding language modeling objective.An evaluation on the GAP dataset of pronoun-name anaphora resolution yields promising results.To summarize the contributions of this paper: • We present a generative model for coreference resolution which can be trained on labeled and unlabeled data.
Figure 2: Overview of the model architecture.
systems, inference complexity is linear in the size of the input.

Model
For a given document consisting of a sequence of tokens {w t } T t=1 , we represent text at two levels: • tokens: represented as {x t } T t=1 , where the vector x t ∈ R Dx is computed from any token-level encoder, and T is the total number of tokens.
• entities: represented by a fixed-length memory M t = {(k t , s , where each memory is a tuple of a key k There are two components to the model: the memory unit, which stores and tracks the states of the entities in the text; and the recurrent unit, which controls the memory via a set of gates.An overview is presented in Figure 2.

Recurrent Unit
The recurrent unit is inspired by the Coreferential-GRU, in which the current hidden state of a gated recurrent unit (GRU; Chung et al., 2014) is combined with the state at the time of the most recent mention of the current entity (Dhingra et al., 2018).In this work, however, instread of relying on the coreferential structure of a sentence to construct the computational graph, we make use of an external memory unit to keep track of previously mentioned entities and let the model learn to decide dynamically what to store in such memory.
The representation at time t is computed from a combination of the memory and the pre-recurrent state ht .The memory state is summarized by the weighted sum over values: An update candidate is then computed from a combination of the memory state m t and the previous hidden state h t−1 , controlled by a coreferential gate c t to determine the amount of up-date from memory: , where W c and b c are learned parameters.
Finally, the hidden state is computed as: , where x t is the embedding of token t.

Memory Unit
Pre-recurrent update.The memory gates are controlled by a vector ht , which combines the input x t with the previous hidden state h t−1 .Because this vector is computed before applying the GRU recurrence, we call it the pre-recurrent state, and define it as, ht = tanh (W Detecting entity mentions.The memory gates are a collection of scalars {(u To compute these gates, we first determine whether the token w t is an entity mention.This decision is controlled by a sigmoid activation, e t = σ(φ e • ht ), where φ e ∈ R D h is a learnable vector.Next, we must decide whether w t refers to a previously mentioned entity: Updating existing entities.If w t is a referential entity mention (r t ≈ 1), it may refer to an entity in the memory.To compute the compatibility between w t and each memory, we first summarize the current state as a query vector, q t = f q ( ht ), where f q is a two-layer feed-forward network .
The query vector is then combined with the memory keys to obtain attention scores, α , where the softmax is computed over all cells i, and b is a learnable bias term, inversely proportional to the likelihood of introducing a new entity.The gate r t ∈ [0, 1] determines whether w t is referential, as defined above.
The update gate is then set equal to the query match α (i) t , gated by the salience s t−1 ensures that an update can at most triple the salience of a memory.
Storing new entities.Overwrite operations are used to store new entities.The total amount to overwrite is õt = e t − N i=1 u t , which is the difference between the entity gate and the sum of the update gates.We prefer to overwrite the memory with the lowest salience.
This decision is made differentiable using the Gumbel-softmax distribution (GSM; Jang et al., 2017), o Memory salience.To the extent that each memory is not updated or overwritten, it is copied along to the next timestep.The weight of this copy operation is: r The salience is updated by exponential decay, where γ e and γ n represent the salience decay rate upon seeing an entity or non-entity. 2 Memory state.To update the memory states, we first transform the hidden state h t into the memory domain, obtaining the memory key/value update candidates kt = f k ( ht ) and ṽt = f v ( ht ), where f k is a two-layer residual network with tanh nonlinearities, and f v is a linear projection with a tanh non-linearity.The complete memory updates are therefore, where g k (g v ) is a recurrent GRU update on the key (value) of the memory cell.

Coreference Chains
To compute the probability of coreference between the mentions w t 1 and w t 2 , we first compute the probability that each cell i refers to the same entity at both of those times, ω t is the overwrite gate for token t into memory i.
Furthermore, the probability that mention t 1 is stored in memory i is u t 1 .Therefore, the log probability that two mentions corefer is, 1 Here τ is the "temperature" of the distribution, which is gradually decreased over the training period, until the distribution approaches a one-hot vector indicating the argmax. 2We set γe = exp(log(0.5)/e) with e = 4 denoting the entity half-life, which is the number of entity mentions before the salience decreases by half.The non-entity halflife γn is computed analogously, with n = 30.

Training
The coreference probability defined in Equation 5is a differentiable function of the gates, which in turn are computed from the inputs w 1 , w 2 , . . .w T .We can therefore train the entire network end-toend from a cross-entropy objective, where a loss is incurred for incorrect decisions on the level of token pairs.Specifically, we set y i,j = 1 when w i and w j corefer, and also when both w i and w j are part of the same mention span.The coreference loss is then the cross-entropy Since the hidden state h t is computed recurrently from w 1:t , the reader can also be trained from a language modeling objective.Word probabilities are computed by projecting the hidden state h t by a matrix of output embeddings, and applying the softmax operation.This objective can be used even when coreference annotations are unavailable.

Experiments
As a preliminary evaluation of the ability of the referential reader to correctly track entity references in text, we evaluate against the GAP dataset, recently introduced by Webster et al. (2018).Each instance consists of: (1) a sequence of tokens {w 1 , . . ., w T } extracted from Wikipedia biographical pages; (2) two person names (A and B, whose token index spans are denoted s A and s B ); (3) a single-token pronoun (P with the token index s P ); and (4) two binary labels (y A and y B ) indicating whether P is referring to A or B. The task is then to identify the coreferential relationship between P and A or B given {w 1 , . . ., w T }.
Language modeling.Given the limited size of GAP, it is difficult for the model to learn a good representation of text.We therefore consider the task of language modeling as a pre-training step.We make use of the page text of the original Wikipedia articles from GAP, the URLs to which are included as part of the data release.3This results in a corpus of 3.8 million tokens.We pretrain the referential reader on this data.The reader is free to use the memory to improve its language modeling performance, but it receives no supervision on the coreference links that might be imputed on this unlabeled data.Evaluation.We measure the performance of our model with the GAP evaluation script.We report the overall F 1 , as well as the scores by gender (Masculine: F M 1 and Feminine: F F 1 ), and the bias (the ratio of F F 1 to F M 1 : ).
Systems.We benchmark against a collection of strong baselines presented in the work of Webster et al. (2018): (1) the mention-pair coreference resolution system of Lee et al. (2017); (2) a rulebased system based on syntactic parallelism (Webster et al., 2018); (3) a domain-specific variant of (2) that incorporates the lexical overlap between each candidate and the title of the original Wikipedia page (Webster et al., 2018).The configuration of our Referential Reader model is described in the supplement Appendix A.
Results.Experimental results on the GAP test set is shown in Table 1.The RefReader model, when properly initialized with a pretrained language model, achieves state-of-the-art performance, beating out strong pre-trained systems (e.g., Lee et al., 2017) as well as domainspecific heuristics (Parellelism+URL).The model substantially benefits from a pre-trained language model, with an absolute gain of 3.2 in F 1 when compared against the variant trained with the pronoun coreference objective only.This demonstrates the ability of RefReader to leverage unlabeled text, which is a distinctive feature in comparison with prior work.Interestingly, when training is carried out in the unsupervised setting (with the language modeling objective only), the model is still capable of learning the latent coreferential structure between pronouns and names to some extent.In all cases, gender bias is minimal, especially in comparison with most of the coreference systems that were trained on OntoNotes.
GAP examples are short, containing just a few entity mentions.To test the applicability of our method to longer instances, we produce an alternative test set in which pairs of GAP instances are concatenated together, doubling the average number of tokens and entity mentions.Even with a memory size of two, performance drops only slightly, to F 1 = 71.4(from F 1 = 72.1 on the original test set).This demonstrates that the model is capable of reusing memory cells when the number of entities is larger than the size of the memory.An example is shown in Appendix B.

Related Work
Memory networks provide a general architecture for online updates to a set of distinct memories (Weston et al., 2015;Sukhbaatar et al., 2015).Henaff et al. (2017) used memories to track the states of multiple entities in a text, but they predefined the alignment of entities to memories, rather than learning to align entities with memories using gates.The incorporation of entities into language models has also been explored in prior work (Yang et al., 2017;Kobayashi et al., 2017); similarly, Dhingra et al. (2018) augment the gated recurrent unit (GRU) architecture with additional edges between coreferent mentions.In general, this line of prior work assumes that coreference information is available at test time (e.g., from a coreference resolution system), rather than determining coreference in an online fashion.Ji et al. (2017) propose a generative entity-aware language model that incorporates coreference as a discrete latent variable.For this reason, importance sampling is required for inference, and the model cannot be trained on unlabeled data.

Conclusion
This paper demonstrates the viability of anaphora resolution in an online framework, using an endto-end differentiable memory network architecture.This enables semi-supervised learning from a language modeling objective, which substantially improves performance on the GAP dataset.A key question for future work is the performance on longer texts, such as the full-length news articles encountered in OntoNotes, which would presumably require a larger memory.Another interesting direction is to further explore semi-supervised learning, by reducing the amount of training data.et al., 2014); for +LM+Coref, inherited from the pre-trained language model embeddings).Language modeling pre-training is carried out using the same set of hyper-parameters with embedding update and early stopping based on perplexity on the validation set.

B Example
Figure 3 gives an example of the behavior of the referential reader, as applied to a concatenation of two instances from GAP.The top panel shows the salience of each entity as each token is consumed, with the two memory cells distinguished by color.The figure elides long spans of tokens whose gate activations are nearly zero.These tokens are indicated in the x-axis by ellipsis; the corresponding decrease in salience is larger, because it represents a longer span of text.The bottom panel shows the gate activations for each token, with memory cells again distinguished by color, and operations distinguished by line style.The gold token-entity assignments are indicated with color.
The reader essentially ignores the first name, Braylon Edwards, making a very weak overwrite to memory 0 (m0).It then makes a large over-write to m0 on the pronoun his.When encountering the token Avant, the reader makes an update to the same memory cell, creating a cataphoric link between Avant and his.The name Padbury appears much later (as indicated by the ellipsis), and at this point, m0 has lower salience than m1.For this reason, the reader choses to overwrite m0 with this name.The reader ignores the name Cathy Vespers and overwrites m1 with the adverb coincidentally.On encountering the final pronoun she, the reader is conflicted, and makes a partial overwrite to m0, a partial update (indicating coreference with Padbury), and a weaker update to m1.If the update to m0 is above the threshold, then the reader may receive credit for this coreference edge, which would otherwise be scored as a false negative.
The reader ignores the names Braylon Edwards, Piers Haggard, and Cathy Vespers, leaving them out of the memory.Edwards and Vespers appear in prepositional phrases, while Haggard is a possessive determiner of the object of a prepositional phrase.Centering theory argues that these syntactic positions have low salience in comparison with subject and object position (Grosz et al., 1995).It is possible that the reader has learned this principle, and that this is why it chooses not to store these names in memory.However, the reader also learns from the GAP supervision that pronouns are important, and therefore stores the pronoun his even though it is also a possessive determiner.
Figure 1: A referential reader with two memory cells.Overwrite and update are indicated by o (i) t and u (i) t ; in practice, these operations are continuous gates.Thickness and color intensity of edges between memory cells at neighboring steps indicate memory salience; indicates an overwrite.

Figure 3 :
Figure 3: An example of the referential reader, as applied to a concatenation of two instances from GAP.The ground truth is indicated by the color of each token on the x-axis.
P of the pronoun and predict the positive coreferential relation of the pronoun P and person name A if any (in the span of s A ) of ψs A ,s P (if s A < s P ) or ψs P ,s A (otherwise) is greater than a threshold value (selected on the validation set).4 Early stopping is applied based on the performance on the validation set.We use the following model hyper-parameters: embedding size D x = 300, memory key and value size D k = 16, D v = 300, number of memory cells N = 2, the size of both the pre-recurrent and hidden states is set to D h = 300, halflife and entity halflife is 30 and 4 respectively, the dimensions of the hidden layers in f k /GRU k and f v /GRU v are the same as D k = 16, D v = 300.Gumbel softmax starts at temperature τ = 1.0 with an exponential decay rate of 0.5 applied evey 10 epochs.Dropout is applied to the embedding layer, pre-recurrent state ht , and GRU hidden state h t , with a rate of 0.5.Embeddings are not updated during training (for +LM, initialized with GloVe (Pennington