Neural Text Generation from Structured Data with Application to the Biography Domain

This paper introduces a neural model for concept-to-text generation that scales to large, rich domains. It generates biographical sentences from fact tables on a new dataset of biographies from Wikipedia. This set is an order of magnitude larger than existing resources with over 700k samples and a 400k vocabulary. Our model builds on conditional neural language models for text generation. To deal with the large vocabulary, we extend these models to mix a ﬁxed vocabulary with copy actions that transfer sample-speciﬁc words from the input database to the generated output sentence. To deal with structured data, we allow the model to embed words differently depending on the data ﬁelds in which they occur. Our neural model significantly outperforms a Templated Kneser-Ney language model by nearly 15 BLEU.


Introduction
Concept-to-text generation renders structured records into natural language (Reiter et al., 2000). A typical application is to generate a weather forecast based on a set of structured meteorological measurements. In contrast to previous work, we scale to the large and very diverse problem of generating biographies based on Wikipedia infoboxes. An infobox is a fact table describing a person, similar to a person subgraph in a knowledge base (Bollacker et al., 2008;Ferrucci, 2012). Similar generation applications include the generation of product descriptions based on a catalog of millions of items with dozens of attributes each.
Previous work experimented with datasets that contain only a few tens of thousands of records such as WEATHERGOV or the ROBOCUP dataset, while our dataset contains over 700k biographies from * Rémi performed this work while interning at Facebook.
Wikipedia. Furthermore, these datasets have a limited vocabulary of only about 350 words each, compared to over 400k words in our dataset.
To tackle this problem we introduce a statistical generation model conditioned on a Wikipedia infobox. We focus on the generation of the first sentence of a biography which requires the model to select among a large number of possible fields to generate an adequate output. Such diversity makes it difficult for classical count-based models to estimate probabilities of rare events due to data sparsity. We address this issue by parameterizing words and fields as embeddings, along with a neural language model operating on them (Bengio et al., 2003). This factorization allows us to scale to a larger number of words and fields than Liang et al. (2009), or Kim and Mooney (2010) where the number of parameters grows as the product of the number of words and fields.
Moreover, our approach does not restrict the relations between the field contents and the generated text. This contrasts with less flexible strategies that assume the generation to follow either a hybrid alignment tree (Kim and Mooney, 2010), a probabilistic context-free grammar (Konstas and Lapata, 2013), or a tree adjoining grammar (Gyawali and Gardent, 2014).
Our model exploits structured data both globally and locally. Global conditioning summarizes all information about a personality to understand highlevel themes such as that the biography is about a scientist or an artist, while as local conditioning describes the previously generated tokens in terms of the their relationship to the infobox. We analyze the effectiveness of each and demonstrate their complementarity.

Related Work
Traditionally, generation systems relied on rules and hand-crafted specifications (Dale et al., 2003;Reiter et al., 2005;Green, 2006;Galanis and Androut-sopoulos, 2007;Turner et al., 2010). Generation is divided into modular, yet highly interdependent, decisions: (1) content planning defines which parts of the input fields or meaning representations should be selected; (2) sentence planning determines which selected fields are to be dealt with in each output sentence; and (3) surface realization generates those sentences.
Data-driven approaches have been proposed to automatically learn the individual modules. One approach first aligns records and sentences and then learns a content selection model (Duboue and McKeown, 2002;Barzilay and Lapata, 2005). Hierarchical hidden semi-Markov generative models have also been used to first determine which facts to discuss and then to generate words from the predicates and arguments of the chosen facts (Liang et al., 2009). Sentence planning has been formulated as a supervised set partitioning problem over facts where each partition corresponds to a sentence (Barzilay and Lapata, 2006). End-to-end approaches have combined sentence planning and surface realization by using explicitly aligned sentence/meaning pairs as training data (Ratnaparkhi, 2002;Wong and Mooney, 2007;Belz, 2008;Lu and Ng, 2011). More recently, content selection and surface realization have been combined (Angeli et al., 2010;Kim and Mooney, 2010;Konstas and Lapata, 2013).
At the intersection of rule-based and statistical methods, hybrid systems aim at leveraging human contributed rules and corpus statistics (Langkilde and Knight, 1998;Soricut and Marcu, 2006;Mairesse and Walker, 2011).
Our model is most similar to Mei et al. (2016) who use an encoder-decoder style neural network model to tackle the WEATHERGOV and ROBOCUP tasks. Their architecture relies on LSTM units and an attention mechanism which reduces scalability compared to our simpler design.  1914-21 November 1987 was an English linguist, plant pathologist, computer scientist, mathematician, mystic, and mycologist.".

Language Modeling for Constrained Sentence generation
Conditional language models are a popular choice to generate sentences. We introduce a tableconditioned language model for constraining text generation to include elements from fact tables.

Language model
Given a sentence s = w 1 , . . . , w T with T words from vocabulary W, a language model estimates: Let c t = w t−(n−1) , . . . , w t−1 be the sequence of n − 1 context words preceding w t . An n-gram language model makes an order n Markov assumption, (2)

Language model conditioned on tables
A table is a set of field/value pairs, where values are sequences of words. We therefore propose language models that are conditioned on these pairs. Local conditioning refers to the information from the table that is applied to the description of the words which have already generated, i.e. the previous words that constitute the context of the language model. The table allows us to describe each word not only by its string (or index in the vocabulary) but also by a descriptor of its occurrence in the table. Let F define the set of all possible fields f . The occurrence of a word w in the table is described by a set of (field, position) pairs.
where m is the number of occurrences of w. Each pair (f, p) indicates that w occurs in field f at position p. In this scheme, most words are described by the empty set as they do not occur in the table. For example, the word linguistics in the table of Figure 1 is described as follows: assuming words are lower-cased and commas are treated as separate tokens. Conditioning both on the field type and the position within the field allows the model to encode field-specific regularities, e.g., a number token in a date field is likely followed by a month token; knowing that the number is the first token in the date field makes this even more likely.
The (field, position) description scheme of the table does not allow to express that a token terminates a field which can be useful to capture field transitions. For biographies, the last token of the name field is often followed by an introduction of the birth date like '(' or 'was born'. We hence extend our descriptor to a triplet that includes the position of the token counted from the end of the field: where our example becomes: We extend Equation 2 to use the above information as additional conditioning context when generating a sentence s: where z ct = z w t−(n−1) , . . . , z w t−1 are referred to as the local conditioning variables since they describe the local context (previous word) relations with the table.
Global conditioning refers to information from all tokens and fields of the table, regardless whether they appear in the previous generated words or not. The set of fields available in a table often impacts the structure of the generation. For biographies, the fields used to describe a politician are different from the ones for an actor or an athlete. We introduce global conditioning on the available fields g f as Similarly, global conditioning g w on the available words occurring in the table is introduced: Tokens provide information complementary to fields. For example, it may be hard to distinguish a basketball player from a hockey player by looking only at the field names, e.g. teams, league, position, weight and height, etc. However the actual field tokens such as team names, league name, player's position can help the model to give a better prediction.
Here, g f ∈ {0, 1} F and g w ∈ {0, 1} W are binary indicators over fixed field and word vocabularies. Figure 2 illustrates the model with a schematic example. For predicting the next word w t after a given context c t , the language model is conditioned on sets of triplets for each word occurring in the table z ct , along with all fields and words from this table.

Copy actions
So far we extended the model conditioning with features derived from the fact table. We now turn to using table information when scoring output words. In particular, sentences which express facts from a given table often copy words from the table. We therefore extend our model to also score special field tokens such as name 1 or name 2 which are subsequently added to the score of the corresponding words from the field value.
Our model reads a table and defines an output domain W ∪Q. Q defines all tokens in the table, which might include out of vocabulary words ( / ∈ W). For instance Park-Rhodes in Figure 1 is not in W. However, Park-Rhodes will be included in Q as name 2 (since it is the second token of the name field) which allows our model to generate it. This mechanism is inspired by recent work on attention based word copying for neural machine translation (Luong et al., 2015) as well as delexicalization for neural dialog systems (Wen et al., 2015). It also builds upon older work such as class-based language models for dialog systems (Oh and Rudnicky, 2000).

A Neural Language Model Approach
A feed-forward neural language model (NLM) estimates P (w t |c t ) with a parametric function φ θ (Equation 1), where θ refers to all learnable parameters of the network. This function is a composition of simple differentiable functions or layers.

Mathematical notations and layers
We denote matrices as bold upper case letters (X, Y, Z), and vectors as bold lower-case letters (a, b, c). A i represents the i th row of matrix A. When A is a 3-d matrix, then A i,j represents the vector of the i th first dimension and j th second dimension. Unless otherwise stated, vectors are assumed to be column vectors. We use [v 1 ; v 2 ] to denote vector concatenation. Next, we introduce the notation for the different layers used in our approach.
Embedding layer. Given a parameter matrix X ∈ R N ×d , the embedding layer is a lookup table that performs an array indexing operation: where X i corresponds to the embedding of the element x i at row i. When X is a 3-d matrix, the lookup table takes two arguments: where X i,j corresponds to the embedding of the pair (x i , x j ) at index (i, j). The lookup table operation can be applied for a sequence of elements s = x 1 , . . . , x T . A common approach is to concatenate all resulting embeddings: Linear layer. This layer applies a linear transformation to its inputs x ∈ R n : where θ = {W, b} are the trainable parameters with W ∈ R m×n being the weight matrix, and b ∈ R m is the bias term.
Softmax layer. Given a context input c t , the final layer outputs a score for each word w t ∈ W, φ θ (c t ) ∈ R |W| . The probability distribution is obtained by applying the softmax activation function:

Embeddings as inputs
A key aspect of neural language models is the use of word embeddings. Similar words tend to have similar embeddings and thus share latent features. The probability estimates of those models are smooth functions of these embeddings, and a small change in the features results in a small change in the probability estimates (Bengio et al., 2003). Therefore, neural language models can achieve better generalization for unseen n-grams. Next, we show how we map fact tables to continuous space in similar spirit.
Word embeddings. Formally, the embedding layer maps each context word index to a continuous d-dimensional vector. It relies on a parameter matrix E ∈ R |W|×d to convert the input c t into n − 1 vectors of dimension d: E can be initialized randomly or with pre-trained word embeddings.
Table embeddings. As described in Section 3.2, the language model is conditioned on elements from the table. Embedding matrices are therefore defined to model both local and global conditioning information. For local conditioning, we denote the maximum length of a sequence of words as l. Each field f j ∈ F is associated with 2 × l vectors of d dimensions, the first l of those vectors embed all possible starting positions 1, . . . , l, and the remaining l vectors embed ending positions. This results in two parameter matrices Z = {Z + , Z − } ∈ R |F |×l×d . For a given triplet (f j , p + i , p − i ), ψ Z + (f j , p + i ) and ψ Z − (f j , p − i ) refer to the embedding vectors of the start and end position for field f j , respectively.
Finally, global conditioning uses two parameter matrices G f ∈ R |F |×g and G w ∈ R |W|×g . ψ G f (f j ) maps a table field f j into a vector of dimension g, while ψ G w (w t ) maps a word w t into a vector of the same dimension. In general, G w shares its parameters with E, provided d = g.
Aggregating embeddings. We represent each occurence of a word w as a triplet (field, start, end) where we have embeddings for the start and end position as described above. Often times a particular word w occurs multiple times in a table, e.g., 'lin-guistics' has two instances in Figure 1. In this case, we perform a component-wise max over the start embeddings of all instances of w to obtain the best features across all occurrences of w. We do the same for end position embeddings: A special no-field embedding is assigned to w t when the word is not associated to any fields. An embedding ψ Z (z ct ) for encoding the local conditioning of the input c t is obtained by concatenation.
For global conditioning, we define F q ⊂ F as the set of all the fields in a given table q, and Q as the set of all words in q. We also perform max aggregation. This yields the vectors The final embedding which encodes the context input with conditioning is then the concatenation of these vectors: For simplification purpose, we define the context input x = {c t , z ct , g f , g w } in the following equations. This context embedding is mapped to a latent context representation using a linear operation followed by a hyperbolic tangent: where α 2 = {W 2 , b 2 }, with W 2 ∈ R nhu×d 1 and b 2 ∈ R nhu .

In-vocabulary outputs
The hidden representation of the context then goes to another linear layer to produce a real value score for each word in the vocabulary: where α 3 = {W 3 , b 3 }, with W 3 ∈ R |W|×nhu and b 3 ∈ R |W| , and α = {α 1 , α 2 , α 3 }.

Mixing outputs for better copying
Section 3.3 explains that each word w from the table is also associated with z w , the set of fields in which it occurs, along with the position in that field. Similar to local conditioning, we represent each field and position pair (f j , p i ) with an embedding ψ F (f j , p i ), where F ∈ R |F|×l×d . These embeddings are then projected into the same space as the latent representation of context input h(x) ∈ R nhu . Using the max operation over the embedding dimension, each word is finally embedded into a unique vector: where β = {W 4 , b 4 } with W 4 ∈ R nhu×d , and b 4 ∈ R nhu . A dot product with the context vector produces a score for each word w in the table, Each word w ∈ W ∪ Q receives a final score by summing the vocabulary score and the field score: with θ = {α, β}, and where φ Q β (x, w) = 0 when w / ∈ Q. The softmax function then maps the scores to a distribution over W ∪ Q,

Training
The neural language model is trained to minimize the negative log-likelihood of a training sentence s with stochastic gradient descent (SGD; LeCun et al. 2012) :

Experiments
Our neural network model (Section 4) is designed to generate sentences from tables for large-scale problems, where a diverse set of sentence types need to be generated. Biographies are therefore a good framework to evaluate our model, with Wikipedia offering a large and diverse dataset.

Biography dataset
We introduce a new dataset for text generation, WIKIBIO, a corpus of 728,321 articles from English Wikipedia (Sep 2015). It comprises all biography articles listed by WikiProject Biography 1 which also have a table (infobox). We extract and tokenize the first sentence of each article with Stanford CoreNLP (Manning et al., 2014). All numbers are mapped to a special token, except for years which are mapped to different special token. Field values from tables are similarly tokenized. All tokens are lower-cased. Table 2 summarizes the dataset statistics: on average, the first sentence is twice as short as the table (26.1 vs 53.1 tokens), about a third of the sentence tokens (9.5) also occur in the table. The final corpus has been divided into three sub-parts to provide training (80%), validation (10%) and test sets (10%). The dataset is available for download 2 .

Baseline
Our baseline is an interpolated Kneser-Ney (KN) language model and we use the KenLM toolkit to train 5-gram models without pruning (Heafield et al., 2013). We also learn a KN language model over templates. For that purpose, we replace the words occurring in both the table and the training sentences with a special token reflecting its table descriptor z w (Equation 3). The introduction section of the table in Figure 1 looks as follows under this scheme: "name 1 name 2 ( birthdate 1 birthdate 2 birthdate 3deathdate 1 deathdate 2 deathdate 3 ) was an english linguist , fields 3 pathologist , fields 10 scientist , mathematician , mystic and mycologist ." During inference, the decoder is constrained to emit words from the regular vocabulary or special tokens occurring in the input table. When picking a special token we copy the corresponding word from the table.

Training setup
For our neural models, we train 11-gram language models (n = 11) with a learning rate set to 0.0025.  Table 1: BLEU, ROUGE, NIST and perplexity without copy actions (first three rows) and with copy actions (last five rows). For neural models we report "mean + − standard deviation" for five training runs with different initialization. Decoding beam width is 5. Perplexities marked with and † are not directly comparable as the output vocabularies differ slightly.    Table 3 describes the other hyper-parameters. We include all fields occurring at least 100 times in the training data in F, the set of fields. We include the 20, 000 most frequent words in the vocabulary. The other hyperparameters are set through validation, maximizing BLEU over a validation subset of 1, 000 sentences. Similarly, early stopping is applied: training ends when BLEU stops improving on the same validation subset. One should note that the maximum number of tokens in a field l = 10 means that we encode only 10 positions: for longer field values the final tokens are not dropped but their position is capped to 10. We initialize the word embeddings W from Hellinger PCA computed over the set of training biographies. This representation has shown to be helpful for various applications (Lebret and Collobert, 2014).

Evaluation metrics
We use different metrics to evaluate our models. Performance is first evaluated in terms of perplexity which is the standard metric for language modeling. Generation quality is assessed automatically with BLEU-4, ROUGE-4 (F-measure) and NIST-4 3 (Belz and Reiter, 2006).

Results
This section describes our results and discusses the impact of the different conditioning variables.

The more, the better
The results (Table 1) show that more conditioning information helps to improve the performance of our models. The generation metrics BLEU, ROUGE and NIST all gives the same performance ordering over models. We first discuss models without copy actions (the first three results) and then discuss models with copy actions (the remaining results). Note that the factorization of our models results in three different output domains which makes perplexity comparisons less straightforward: models without copy actions operate over a fixed vocabulary. Template KN adds a fixed set of field/position pairs to this vocabulary while Table NLM models a variable set Q depending on the input table, see Section 3.3. Without copy actions. In terms of perplexity the (i) neural language model (NLM) is slightly better than an interpolated KN language model, and (ii) adding local conditioning on the field start and end position further improves accuracy. Generation metrics are generally very low but there is a clear improvement when using local conditioning since it allows to learn transitions between fields by linking previous predictions to the table unlike KN or plain NLM.
With copy actions. For experiments with copy actions we use the full local conditioning (Equation 4) in the neural language models. BLEU, ROUGE and NIST all improves when moving from Template KN to Table NLM and more features successively improve accuracy. Global conditioning on the fields improves the model by over 7 BLEU and adding words gives an additional 1.3 BLEU. This is a total improvement of nearly 15 BLEU over the Template Kneser-Ney baseline. Similar observations are made for ROUGE +15 and NIST +2.8.  (Table NLM) and the baseline (Template KN) for different beam sizes. The x-axis is the average timing (in milliseconds) for generating one sentence. The y-axis is the BLEU score. All results are measured on a subset of 1,000 samples of the validation set.

Attention mechanism
Our model implements attention over input table fields. For each word w in the table, Equation (23) takes the language model score φ W ct and adds a bias φ Q ct . The bias is the dot-product between a representation of the table field in which w occurs and a representation of the context, Equation (22) that summarizes the previously generated fields and words.  Each row represents the probability distribution over (field, position) pairs given the previous words (i.e. the words heading the preceding rows as well as the current row). Darker colors depict higher probabilities. Figure 4 shows that this mechanism adds a large bias to continue a field if it has not generated all tokens from the table, e.g., it emits the word occurring in name 2 after generating name 1. It also nicely handles transitions between field types, e.g., the model adds a large bias to the words occurring in the occupation field after emitting the birthdate.

Sentence decoding
We use a standard beam search to explore a larger set of sentences compared to simple greedy search. This allows us to explore K times more paths which comes at a linear increase in the number of forward computation steps for our language model. We compare various beam settings for the baseline Template KN and our Table NLM (Figure 3). The best validation BLEU can be obtained with a beam size of K = 5. Our model is also several times faster than the baseline, requiring only about 200 ms per sentence with K = 5. Beam search generates many ngram lookups for Kneser-Ney which requires many   random memory accesses; while neural models perform scoring through matrix-matrix products, an operation which is more local and can be performed in a block parallel manner where modern graphic processors shine (Kindratenko, 2014). Table 4 shows generations for different variants of our model based on the Wikipedia table in Figure 1. First of all, comparing the reference to the fact table reveals that our training data is not perfect. The birth month mentioned in the fact table and the first sentence of the Wikipedia article are different; this may have been introduced by one contributor editing the article and not keeping the information consistent.

Qualitative analysis
All three versions of our model correctly generate the beginning of the sentence by copying the name, the birth date and the death date from the table. The model correctly uses the past tense since the death date in the table indicates that the person has passed away. Frederick Parker-Rhodes was a scientist, but this occupation is not directly mentioned in the table. The model without global conditioning can therefore not predict the right occupation, and it continues the generation with the most common occupation (in Wikipedia) for a person who has died. In contrast, the global conditioning over the fields helps the model to understand that this person was indeed a scientist. However, it is only with the global conditioning on the words that the model can infer the correct occupation, i.e., computer scientist.

Conclusions
We have shown that our model can generate fluent descriptions of arbitrary people based on structured data. Local and global conditioning improves our model by a large margin and we outperform a Kneser-Ney language model by nearly 15 BLEU. Our task uses an order of magnitude more data than previous work and has a vocabulary that is three orders of magnitude larger.
In this paper, we have only focused on generating the first sentence and we will tackle the generation of longer biographies in future work. Also, the encoding of field values can be improved. Currently, we only attach the field type and token position to each word type and perform a max-pooling for local conditioning. One could leverage a richer representation by learning an encoder conditioned on the field type, e.g. a recurrent encoder or a convolutional encoder with different pooling strategies.
Furthermore, the current training loss function does not explicitly penalize the model for generating incorrect facts, e.g. predicting an incorrect nationality or occupation is currently not considered worse than choosing an incorrect determiner. A loss function that could assess factual accuracy would certainly improve sentence generation by avoiding such mistakes. Also it will be important to define a strategy for evaluating the factual accuracy of a generation, beyond BLEU, ROUGE or NIST.