Reference-Aware Language Models

We propose a general class of language models that treat reference as discrete stochastic latent variables. This decision allows for the creation of entity mentions by accessing external databases of referents (required by, e.g., dialogue generation) or past internal state (required to explicitly model coreferentiality). Beyond simple copying, our coreference model can additionally refer to a referent using varied mention forms (e.g., a reference to “Jane” can be realized as “she”), a characteristic feature of reference in natural languages. Experiments on three representative applications show our model variants outperform models based on deterministic attention and standard language modeling baselines.


Introduction
Referring expressions (REs) in natural language are noun phrases (proper nouns, common nouns, and pronouns) that identify objects, entities, and events in an environment. REs occur frequently and they play a key role in communicating information efficiently. While REs are common in natural language, most previous work does not model them explicitly, either treating REs as ordinary words in the model or replacing them with special tokens that are filled in with a post processing step (Wen et al., 2015;Luong et al., 2015). Here we propose a language modeling framework that explicitly incorporates reference decisions. In part, this is based on the principle of pointer networks in which copies are made from another source (Gülçehre et al., 2016;Gu et al., 2016;Ling et al., * Work completed at DeepMind. 2016; Merity et al., 2016). However, in the full version of our model, we go beyond simple copying and enable coreferent mentions to have different forms, a key characteristic of natural language reference. Figure 1 depicts examples of REs in the context of the three tasks that we consider in this work. First, many models need to refer to a list of items (Kiddon et al., 2016;Wen et al., 2015). In the task of recipe generation from a list of ingredients (Kiddon et al., 2016), the generation of the recipe will frequently refer to these items. As shown in Figure 1, in the recipe "Blend soy milk and . . . ", soy milk refers to the ingredient summaries. Second, reference to a database is crucial in many applications. One example is in task oriented dialogue where access to a database is necessary to answer a user's query (Young et al., 2013;Wen et al., 2015;Sordoni et al., 2015;Serban et al., 2016;Bordes and Weston, 2016;Williams and Zweig, 2016;Shang et al., 2015;Wen et al., 2016). Here we consider the domain of restaurant recommendation where a system refers to restaurants (name) and their attributes (address, phone number etc) in its responses. When the system says "the nirala is a nice restaurant", it refers to the restaurant name the nirala from the database. Finally, we address references within a document (Mikolov et al., 2010;Wang and Cho, 2015), as the generation of words will often refer to previously generated words. For instance the same entity will often be referred to throughout a document. In Figure 1, the entity you refers to I in a previous utterance. In this case, copying is insufficient-although the referent is the same, the form of the mention is different.
In this work we develop a language model that has a specific module for generating REs. A series of decisions (should I generate a RE? If yes, which entity in the context should I refer to? How should the RE be rendered?) augment a traditional recurrent neural network language model and the two components are combined as a mixture model. Selecting an entity in context is similar to familiar models of attention (Bahdanau et al., 2014), but rather than being a soft decision that reweights representations of elements in the context, it is treated as a hard decision over contextual elements which are stochastically selected and then copied or, if the task warrants it, transformed (e.g., a pronoun rather than a proper name is produced as output). In cases when the stochastic decision is not available in training, we treat it as a latent variable and marginalize it out. For each of the three tasks, we pick one representative application and demonstrate our reference aware model's efficacy in evaluations against models that do not explicitly include a reference operation.
Our contributions are as follows: • We propose a general framework to model reference in language. We consider reference to entries in lists, tables, and document context. We instantiate these tasks into three specific applications: recipe generation, dialogue modeling, and coreference based language models.
• We develop the first neural model of reference that goes being copying and can model (conditional on context) how to form the mention.
• We perform comprehensive evaluation of our models on the three data sets and verify our proposed models perform better than strong baselines.
2 Reference-aware language models Here we propose a general framework for reference-aware language models. We denote each document as a series of tokens x 1 , . . . , x L , where L is the number of tokens. Our goal is to maximize p(x i | c i ), the probability of each word x i given its previous context c i = x 1 , . . . , x i−1 . In contrast to traditional neural language models, we introduce a variable z i at each position, which controls the decision on which source x i is generated from. Then the conditional probability is given by: where z i has different meanings in different contexts. If x i is from a reference list or a database, then z i is one dimensional and z i = 1 denotes x i is generated as a reference. z i can also model more complex decisions. In coreference based language model, z i denotes a series of sequential decisions, such as whether x i is an entity, if yes, which entity x i refers to. When z i is not observed, we will train our model to maximize the marginal probability over z i , i.e., p(

Reference to lists
We begin to instantiate the framework by considering reference to a list of items. Referring to a list of items has broad applications, such as generating documents based on summaries etc. Here we specifically consider the application of recipe generation conditioning on the ingredient lists. Table. 1 illustrates the ingredient list and recipe for Spinach and Banana Power Smoothie. We can see that the ingredients soy milk, spinach leaves, and banana occur in the recipe. Blend soy milk and spinach leaves together in a blender until smooth. Add banana and pulse until thoroughly blended.
3/4 cup packed fresh spinach leaves 1 large banana, sliced We would like to model p(y|X) = Π v p(y v |X, y <v ).
We first use a LSTM (Hochreiter and Schmidhuber, 1997) to encode each ingredient: Then, we sum the resulting final state of each ingredient to obtain the starting LSTM state of the decoder. We use an attention based decoder: where ATTN(h, q) is the attention function that returns the probability distribution over the set of vectors h, conditioned on any input representation q. A full description of this operation is described in (Bahdanau et al., 2014). The decision to copy from the ingredient list or generate a new word from the softmax is performed using a switch, denoted as p(z v |s v ). We can obtain a probability distribution of copying each of the words in the ingredients by computing p Objective: We can obtain the value of z v through a string match of tokens in recipes with tokens in ingredients. If a token appears in the ingredients, we set z v = 1 and z v = 0 otherwise. We can train the model in a fully supervised fashion, i.e., we can obtain the probability of y v as However, it may be not be accurate. In many cases, the tokens that appear in the ingredients do not specifically refer to ingredients tokens. For examples, the recipe may start with "Prepare a cup of water". The token "cup" does not refer to the "cup" in the ingredient list "1 cup plain soy milk".
To solve this problem, we treat z i as a latent variable, we wish to maximize the marginal probability of y v over all possible values of z v . In this way, the model can automatically learn when to refer to tokens in the ingredients. Thus, the probability of generating token y v is defined as: If no string match is found for y v , we simply set p copy v (y v ) = 0 in the above objective.

Reference to databases
We then consider the more complicated task of reference to database entries. Referring to databases is quite common in question answering and dialogue systems, in which databases are external knowledge and they are resorted to reply users' query. In our paper, we consider the application of task-oriented dialogue systems in the domain of restaurant recommendations. Different from lists that are one dimensional, databases are twodimensional and referring to table entries requires sophisticated model design.
To better understand the model, we first make a brief introduction of the data set. We use dialogues from the second Dialogue State Tracking Challenge (DSTC2) (Henderson et al., 2014). Table. 3 is one example dialogue from this dataset.
We can observe from this example, users get recommendations of restaurants based on queries that specify the area, price and food type of the restaurant. We can support the system's decisions by incorporating a mechanism that allows the model to query the database to find restaurants that satisfy the users' queries. A sample of our database (refer to data preparation part on how we construct the database) is shown in Table 2. We can observe that each restaurant contains 6 attributes that are generally referred in the dialogue dataset. As such, if the user requests a restaurant that serves "indian" food, we wish to train a model that can search for entries whose "food"    Figure 3: Hierarchical RNN Seq2Seq model. The red box denotes attention mechanism over the utterances in the previous turn.
column contains "indian". Now, we describe how we deploy a model that fulfills these requirements. We first introduce the basic dialogue framework in which we incorporates the table reference module. Basic Dialogue Framework: We build a basic dialogue model based on the hierarchical RNN model described in (Serban et al., 2016), as in dialogues, the generation of the response is not only dependent on the previous sentence, but on all sentences leading to the response. We assume that a dialogue is alternated between a machine and a user. An illustration of the model is shown in Figure 3. Consider a dialogue with T turns, the utterances from a user and a machines are denoted , where x ij (y iv ) denotes the j-th (v-th) token in the i-th utterance from the user (the machines). The dialogue sequence starts with a machine utterance and is given by We would like to model the utterances from the machine We encode y <i and x <i into continuous space in a hierarchical way with LSTM: Sentence Encoder: For a given utterance x i , We encode it as The same process is applied to obtain the machine utterance representation h y i = h y i,|y i | . Turn Encoder: We further encode the sequence {h y 1 , h x 1 , ..., h y i , h x i } with another LSTM encoder. We shall refer the last hidden state as u i , which can be seen as the hierarchical encoding of the previous i utterances. Decoder: We use u i−1 as the initial state of decoder LSTM and decode each token in y i . We can express the decoder as: We can also incoroprate the attetionn mechanism in the decoder. As shown in Figure. 3, we use the attention mechanism over the utterance in the previous turn. Due to space limit, we don't present the attention based decoder mathmatically and readers can refer to (Bahdanau et al., 2014) for details.

Incorporating Table Reference
We now extend the decoder in order to allow the model to condition the generation on a database. Pointer Switch: We use z i,v ∈ {0, 1} to denote the decision of whether to copy one cell from the table. We compute this probability as follows: Thus, if z i,v = 1, the next token y i,v is generated from the database, whereas if z i,v = 0, then the following token is generated from a softmax. We now describe how we generate tokens from the database.  Table Pointer Step 1: attribute attn Step 3: row attn Step 5: column attn p a p a p copy p copy p vocab p vocab p c p c p r p r Figure 4: Decoder with table pointer.
We denote a table with R rows and C columns as {t r,c }, r ∈ [1, R], c ∈ [1, C], where t r,c is the cell in row r and column c. The attribute of each column is denoted as s c , where c is the c-th attribute. t r,c and s c are one-hot vector. Table Encoding: To encode the table, we first build an attribute vector and then an encoding vector for each cell. The attribute vector is simply an embedding lookup g c = W E s c . For the encoding of each cell, we first concatenate embedding lookup of the cell with the corresponding attribute vector g c and then feed it through a one-layer MLP as follows: then e r,c = tanh(W [W E t r,c , g c ]).  Figure 4. The attention over cells in the table is conditioned on a given vector q, similarly to the attention model for sequences. However, rather than a sequence of vectors, we now operate over a table.
Step 1: Attention over the attributes to find out the attributes that a user asks about, p a = ATTN({g c }, q). Suppose a user says cheap, then we should focus on the price attribute.
Step 2: Conditional row representation calculation, e r = c p a c e r,c ∀r. So that e r contains the price information of the restaurant in row r.
Step 3: Attention over e r to find out the restaurants that satisfy users' query, p r = ATTN({e r }, q). Restaurants with cheap price will be picked.
Step 4: Using the probabilities p r , we compute the weighted average over the all rows e c = r p r r e r,c . {e r } contains the information of cheap restaurant.
Step 5: Attention over columns {e r } to compute the probabilities of copying each column p c = ATTN({e c }, q).
Step 6: To get the probability matrix of copying each cell, we simply compute the outer product p copy = p r ⊗ p c . The overall process is as follows: If z i,v = 1, we embed the above attention process in the decoder by replacing the conditioned state q with the current decoder state s y i,v . Objective: As in previous task, we can train the model in a fully supervised fashion, or we can treat the decision as a latent variable. We can get p(y i,v |s i,v ) in a similar way.

Reference to document context
Finally, we address the references that happen in a document itself and build a language model that uses coreference links to point to previous entities. Before generating a word, we first make the decision on whether it is an entity mention. If so, we decide which entity this mention belongs to, then we generate the word based on that entity. Denote the document as X = {x i } L i=1 , and the entities are refer to the same entity. We use a LSTM to model the document, the hidden state of each token is h i = LSTM(W E x i , h i−1 ). We use a set h e = {h e 0 , h e 1 , ..., h e M } to keep track of the entity states, where h e j is the state of entity j. Word generation: At each time step before generating the next word, we predict whether the word is an entity mention: where z i denotes whether the next word is an entity and if yes v i denotes which entity the next word corefers to. If the next word is an entity mention, then p(x i |v i , h i−1 , h e ) =  Figure 5: Coreference based language model, example taken from Wiseman et al. (2016).
Entity state update: Since there are multiple mentions for each entity and the mentions appear dynamically, we need to keep track of the entity state in order to use coreference information in entity mention prediction. We update the entity state h e at each time step. In the beginning, h e = {h e 0 }, h e 0 denotes the state of an virtual empty entity and is a learnable variable. If z i = 1 and v i = 0, then it indicates the next word is a new entity mention, then in the next step, we append h i to h e , i.e., h e = {h e , h i }, if z i = 1 and v i > 0, then we update the corresponding entity state with the new hidden state, h e [v i ] = h i . Another way to update the entity state is to use one LSTM to encode the mention states and get the new entity state. Here we use the latest entity mention state as the new entity state for simplicity. The detailed update process is shown in Figure 5.
Note that the stochastic decisions in this task are more complicated than previous two tasks. We need to make two sequential decisions: whether the next word is an entity mention, and if yes, which entity the mention corefers to. It is intractable to marginalize these decisions, so we train this model in a supervised fashion (refer to data preparation part on how we get coreference annotations).

Data sets and preprocessing
Recipes: We crawled all recipes from www. allrecipes.com. There are about 31, 000 recipes in total, and every recipe has an ingredient list and a corresponding recipe. We exclude the recipes that have less than 10 tokens or more than 500 tokens, those recipes take about 0.1% of all data set. On average each recipe has 118 tokens and 9 ingredients. We random shuffle the whole data set and take 80% as training and 10% for validation and test. We use a vocabulary size of 10,000 in the model. Dialogue: We use the DSTC2 data set. We only use the dialogue transcripts from the data set. There are about 3,200 dialogues in total. The table is not available from DSTC2. To reconstruct the table, we crawled TripAdvisor for restaurants in the Cambridge area, where the dialog dataset was collected. Then, we remove restaurants that do not appear in the data set and create a database with 109 restaurants and their attributes (e.g. food type). Since this is a small data set, we use 5fold cross validation and report the average result over the 5 partitions. There may be multiple tokens in each table cell, for example in Table. 2, the name, address, post code and phone number have multiple tokens, we replace them with one special token. For the name, address, post code and phone number of the j-th row, we replace the tokens in each cell with NAME j, ADDR j, POSTCODE j, PHONE j. If a table cell is empty, we replace it with an empty token EMPTY. We do a string match in the transcript and replace the corresponding tokens in transcripts from the table with the special tokens. Each dialogue on average has 8 turns (16 sentences). We use a vocabulary size of 900, including about 400 table tokens and 500 words. Coref LM: We use the Xinhua News data set from Gigaword Fifth Edition and sample 100,000 documents that has length in range from 100 to 500. Each document has on average 234 tokens, so there are 23 million tokens in total. We process the documents to get coreference annotations and use the annotations, i.e., z i , v i , in training. We take 80% as training and 10% as validation and test respectively. We ignore the entities that have only one mention and for the mentions that have multiple tokens, we take the token that is most frequent in the all the mentions for this entity. After preprocessing, tokens that are entity mentions take about 10% of all tokens. We use a vocabulary size of 50,000 in the model.

Baselines, model training and evaluation
We compare our model with baselines that do not model reference explicitly. For recipe generation and dialogue modeling, we compare our model with basic seq2seq and attention model. We also apply attention mechanism over the table for dialogue modeling as a baseline. For coreference based language model, we compare our model with simple RNN language model.
We train all models with simple stochastic gradient descent with gradient clipping. We use a one-layer LSTM for all RNN components. Hyperparameters are selected using grid search based on the validation set. Evaluation of our model is challenging since it involves three rather different applications. We focus on evaluating the accuracy of predicting the reference tokens, which is the goal of our model. Specifically, we report the perplexity of all words, words that can be generated from reference and non-reference words. The perplexity is calculated by multiplying the probability of decision at each step all together. Note that for non-reference words, they also appear in the vocabulary. So it is a fair comparison to models that do not model reference explicitly. For the recipe task, we also generate the recipes using beam size of 10 and evaluate the generated recipes with BLEU. We didn't use BLEU for dialogue generation since the database entries take only a very small part of all tokens in utterances.

Results and analysis
The results for recipe generation, dialogue and coref based language model are shown in Table 4, 5, and 6 respectively. The recipe results in Table 4 verifies that modeling reference explicitly improves performance. Latent and Pointer perform better than Seq2Seq and Attn model. The Latent model performs better than the Pointer model since tokens in ingredients that match with recipes do not necessarily come from the ingredients. Imposing a supervised signal gives wrong information to the model and hence makes the result worse. With latent decision, the model learns to when to copy and when to generate it from the vocabulary.
The findings for dialogue basically follow that of recipe generation, as shown in Table 5. Conditioning table performs better in predicting table tokens in general. Table Pointer has the lowest perplexity for tokens in the table. Since the table tokens appear rarely in the dialogue transcripts, the overall perplexity does not differ much and the non-table token perplexity are similar. With attention mechanism over the table, the perplexity of table token improves over basic Seq2Seq model, but still not as good as directly pointing to cells in the table, which shows the advantage of modeling reference explicitly. As expected, using sentence attention improves significantly over models without sentence attention. Surprisingly, Table  Latent performs much worse than Table Pointer. We also measure the perplexity of table tokens that appear only in test set. For models other than Table Pointer, because the tokens never appear in the training set, the perplexity is quite high, while Table Pointer can predict these tokens much more accurately. This verifies our conjecture that our model can learn reasoning over databases.
The coref based LM results are shown in Table 6. We find that coref based LM performs much better on the entity perplexity, but is a little bit worse for non-entity words. We found it was an optimization problem and the model was stuck in a local optimum. So we initialize the Pointer model with the weights learned from LM, the Pointer model performs better than LM both for entity perplexity and non-entity words perplexity.
In Appendix A, we also visualize the heat map of table reference and list items reference. The visualization shows that our model can correctly predict when to refer to which entries according to context.

Related Work
In terms of methodology, our work is closely related to previous works that incorporate copying mechanism with neural models (Gülçehre et al., 2016;Gu et al., 2016;Ling et al., 2016;. Our models are similar to models proposed in Merity et al., 2016),    where the generation of each word can be conditioned on a particular entry in knowledge lists and previous words. In our work, we describe a model with broader applications, allowing us to condition, on databases, lists and dynamic lists. In terms of applications, our work is related to chit-chat dialogue Sordoni et al., 2015;Serban et al., 2016;Shang et al., 2015) and task oriented dialogue (Wen et al., 2015;Bordes and Weston, 2016;Williams and Zweig, 2016;Wen et al., 2016). Most of previous works on task oriented dialogues embed the seq2seq model in traditional dialogue systems, in which the table query part is not differentiable, while our model queries the database directly. Recipe generation was proposed in (Kiddon et al., 2016). They use attention mechanism over the checklists, whereas our work models ex-plicit references to them. Context dependent language models (Mikolov et al., 2010;Jozefowicz et al., 2016;Mikolov et al., 2010;Wang and Cho, 2015) are proposed to capture long term dependency of text. There are also lots of works on coreference resolution (Haghighi and Klein, 2010;Wiseman et al., 2016). We are the first to combine coreference with language modeling, to the best of our knowledge.

Conclusion
We introduce reference-aware language models which explicitly model the decision of from where to generate the token at each step. Our model can also learns the decision by treating it as a latent variable. We demonstrate on three applications, table based dialogue modeling, recipe generation and coref based LM, that our model performs better than attention based model, which does not incorporate this decision explicitly. There are several directions to explore further based on our framework. The current evaluation method is based on perplexity and BLEU. In task oriented dialogues, we can also try human evaluation to see if the model can reply users' query accurately. It is also interesting to use reinforcement learning to learn the actions in each step in coref based LM.