Challenges in Data-to-Document Generation

Recent neural models have shown significant progress on the problem of generating short descriptive texts conditioned on a small number of database records. In this work, we suggest a slightly more difficult data-to-text generation task, and investigate how effective current approaches are on this task. In particular, we introduce a new, large-scale corpus of data records paired with descriptive documents, propose a series of extractive evaluation methods for analyzing performance, and obtain baseline results using current neural generation methods. Experiments show that these models produce fluent text, but fail to convincingly approximate human-generated documents. Moreover, even templated baselines exceed the performance of these neural models on some metrics, though copy- and reconstruction-based extensions lead to noticeable improvements.


Introduction
Over the past several years, neural text generation systems have shown impressive performance on tasks such as machine translation and summarization. As neural systems begin to move toward generating longer outputs in response to longer and more complicated inputs, however, the generated texts begin to display reference errors, intersentence incoherence, and a lack of fidelity to the source material. The goal of this paper is to suggest a particular, long-form generation task in which these challenges may be fruitfully explored, to provide a publically available dataset for this task, to suggest some automatic evaluation metrics, and finally to establish how current, neural text generation methods perform on this task.
A classic problem in natural-language generation (NLG) (Kukich, 1983;McKeown, 1992;Reiter and Dale, 1997) involves taking structured data, such as a table, as input, and producing text that adequately and fluently describes this data as output. Unlike machine translation, which aims for a complete transduction of the sentence to be translated, this form of NLG is typically taken to require addressing (at least) two separate challenges: what to say, the selection of an appropriate subset of the input data to discuss, and how to say it, the surface realization of a generation (Reiter and Dale, 1997;Jurafsky and Martin, 2014). Traditionally, these two challenges have been modularized and handled separately by generation systems. However, neural generation systems, which are typically trained end-to-end as conditional language models (Mikolov et al., 2010;Sutskever et al., 2011Sutskever et al., , 2014, blur this distinction. In this context, we believe the problem of generating multi-sentence summaries of tables or database records to be a reasonable next-problem for neural techniques to tackle as they begin to consider more difficult NLG tasks. In particular, we would like this generation task to have the following two properties: (1) it is relatively easy to obtain fairly clean summaries and their corresponding databases for dataset construction, and (2) the summaries should be primarily focused on conveying the information in the database. This latter property ensures that the task is somewhat congenial to a standard encoder-decoder approach, and, more importantly, that it is reasonable to evaluate generations in terms of their fidelity to the database.
One task that meets these criteria is that of generating summaries of sports games from associated box-score data, and there is indeed a long history of NLG work that generates sports game summaries (Robin, 1994;Tanaka-Ishii et al., 1998;Barzilay and Lapata, 2005). To this end, we make the following contributions: • We introduce a new large-scale corpus consisting of textual descriptions of basketball games paired with extensive statistical tables. This dataset is sufficiently large that fully data-driven approaches might be sufficient.
• We introduce a series of extractive evaluation models to automatically evaluate output generation performance, exploiting the fact that post-hoc information extraction is significantly easier than generation itself.
• We apply a series of state-of-the-art neural methods, as well as a simple templated generation system, to our data-to-document generation task in order to establish baselines and study their generations.
Our experiments indicate that neural systems are quite good at producing fluent outputs and generally score well on standard word-match metrics, but perform quite poorly at content selection and at capturing long-term structure. While the use of copy-based models and additional reconstruction terms in the training loss can lead to improvements in BLEU and in our proposed extractive evaluations, current models are still quite far from producing human-level output, and are significantly worse than templated systems in terms of content selection and realization. Overall, we believe this problem of data-to-document generation highlights important remaining challenges in neural generation systems, and the use of extractive evaluation reveals significant issues hidden by standard automatic metrics.

Data-to-Text Datasets
We consider the problem of generating descriptive text from database records. Following the notation in Liang et al. (2009), let s = {r j } J j=1 be a set of records, where for each r ∈ s we define r.t ∈ T to be the type of r, and we assume each r to be a binarized relation, where r.e and r.m are a record's entity and value, respectively. For example, a database recording statistics for a basketball game might have a record r such that r.t = POINTS, r.e = RUSSELL WESTBROOK, and r.m = 50. In this case, r.e gives the player in question, and r.m gives the number of points the player scored. From these records, we are interested in generating descriptive text,ŷ 1:T =ŷ 1 , . . . ,ŷ T of T words such thatŷ 1:T is an adequate and fluent summary of s. A dataset for training data-to-document systems typically consists of (s, y 1:T ) pairs, where y 1:T is a document consisting of a gold (i.e., human generated) summary for database s.
Several benchmark datasets have been used in recent years for the text generation task, the most popular of these being WEATHERGOV (Liang et al., 2009) andROBOCUP (Chen andMooney, 2008). Recently, neural generation systems have show strong results on these datasets, with the system of Mei et al. (2016) achieving BLEU scores in the 60s and 70s on WEATHERGOV, and BLEU scores of almost 30 even on the smaller ROBOCUP dataset. These results are quite promising, and suggest that neural models are a good fit for text generation. However, the statistics of these datasets, shown in Table 1, indicate that these datasets use relatively simple language and record structure. Furthermore, there is reason to believe that WEATHERGOV is at least partially machinegenerated (Reiter, 2017). More recently, Lebret et al. (2016) introduced the WIKIBIO dataset, which is at least an order of magnitude larger in terms of number of tokens and record types. However, as shown in Table 1, this dataset too only contains short (single-sentence) generations, and relatively few records per generation. As such, we believe that early success on these datasets is not yet sufficient for testing the desired linguistic capabilities of text generation at a document-scale.
With this challenge in mind, we introduce a new dataset for data-to-document text generation, available at https://github.com/ harvardnlp/boxscore-data. The dataset is intended to be comparable to WEATHERGOV in terms of token count, but to have significantly longer target texts, a larger vocabulary space, and to require more difficult content selection.
The dataset consists of two sources of articles summarizing NBA basketball games, paired with their corresponding box-and line-score tables. The data statistics of these two sources, RO-TOWIRE and SBNATION, are also shown in Table 1. The first dataset, ROTOWIRE, uses professionally written, medium length game summaries targeted at fantasy basketball fans. The writing is colloquial, but relatively well structured, and targets an audience primarily interested in game WIN  Atlanta was in desperate need of a win and they were able to take care of a shorthanded Miami team here . Defense was key for the Hawks , as they held the Heat to 42 percent shooting and forced them to commit 16 turnovers . Atlanta also dominated in the paint , winning the rebounding battle , 47 -34 , and outscoring them in the paint 58 -26.The Hawks shot 49 percent from the field and assisted on 27 of their 43 made baskets . This was a near wire -to -wire win for the Hawks , as Miami held just one lead in the first five minutes . Miami ( 7 -15 ) are as beat -up as anyone right now and it 's taking a toll on the heavily used starters . Hassan Whiteside really struggled in this game , as he amassed eight points , 12 rebounds and one blocks on 4 -of -12 shooting ... Figure 1: An example data-record and document pair from the ROTOWIRE dataset. We show a subset of the game's records (there are 628 in total), and a selection from the gold document. The document mentions only a select subset of the records, but may express them in a complicated manner. In addition to capturing the writing style, a generation system should select similar record content, express it clearly, and order it appropriately. statistics. The second dataset, SBNATION, uses fan-written summaries targeted at other fans. This dataset is significantly larger, but also much more challenging, as the language is very informal, and often tangential to the statistics themselves. We show some sample text from ROTOWIRE in Figure 1. Our primary focus will be on the RO-TOWIRE data.

Evaluating Document Generation
We begin by discussing the evaluation of generated documents, since both the task we introduce and the evaluation methods we propose are motivated by some of the shortcomings of current approaches to evaluation. Text generation systems are typically evaluated using a combination of automatic measures, such as BLEU (Papineni et al., 2002), and human evaluation. While BLEU is perhaps a reasonably effective way of evaluating short-form text generation, we found it to be unsatisfactory for document generation. In particular, we note that it primarily rewards fluent text generation, rather than generations that capture the most important information in the database, or that report the information in a particularly coherent way. While human evaluation, on the other hand, is likely ultimately necessary for evaluating generations (Liu et al., 2016;Wu et al., 2016), it is much less convenient than using automatic metrics. Furthermore, we believe that current text generations are sufficiently bad in sufficiently obvious ways that automatic metrics can still be of use in evaluation, and we are not yet at the point of needing to rely solely on human evaluators.

Extractive Evaluation
To address this evaluation challenge, we begin with the intuition that assessing document quality is easier than document generation. In particular, it is much easier to automatically extract information from documents than to generate documents that accurately convey desired information. As such, simple, high-precision information extraction models can serve as the basis for assessing and better understanding the quality of automatic generations. We emphasize that such an evaluation scheme is most appropriate when evaluating generations (such as basketball game summaries) that are primarily intended to summarize information. While many generation problems do not fall into this category, we believe this to be an interesting category, and one worth focusing on because it is amenable to this sort of evaluation.
To see how a simple information extraction system might work, consider the document in Figure 1. We may first extract candidate entity (player, team, and city) and value (number and certain string) pairs r.e, r.m that appear in the text, and then predict the type r.t (or none) of each candidate pair. For example, we might extract the entity-value pair ("Miami Heat", "95") from the first sentence in Figure 1, and then predict that the type of this pair is POINTS, giving us an extracted record r such that (r.e, r.m, r.t) = (MIAMI HEAT, 95, POINTS). Indeed, many relation extraction systems reduce relation extraction to multi-class classification precisely in this way (Zhang, 2004;Zhou et al., 2008;Zeng et al., 2014;dos Santos et al., 2015).
More concretely, given a documentŷ 1:T , we consider all pairs of word-spans in each sentence that represent possible entities e and values m. We then model p(r.t | e, m; θ) for each pair, using r.t = to indicate unrelated pairs. We use architectures similar to those discussed in Collobert et al. (2011) anddos Santos et al. (2015) to parameterize this probability; full details are given in the Appendix.
Importantly, we note that the (s, y 1:T ) pairs typically used for training data-to-document systems are also sufficient for training the information extraction model presented above, since we can obtain (partial) supervision by simply checking whether a candidate record lexically matches a record in s. 1 However, since there may be multiple records r ∈ s with the same e and m but with different types r.t, we will not always be able to determine the type of a given entity-value pair found in the text. We therefore train our classifier to minimize a latent-variable loss: for all document spans e and m, with observed types t(e, m) = {r.t : r ∈ s, r.e = e, r.m = m} (possibly { }), we minimize We find that this simple system trained in this way is quite accurate at predicting relations. On the 1 Alternative approaches explicitly align the document with the table for this task (Liang et al., 2009). ROTOWIRE data it achieves over 90% accuracy on held-out data, and recalls approximately 60% of the relations licensed by the records.

Comparing Generations
With a sufficiently precise relation extraction system, we can begin to evaluate how well an automatic generationŷ 1:T has captured the information in a set of records s. In particular, since the predictions of a precise information extraction system serve to align entity-mention pairs in the text with database records, this alignment can be used both to evaluate a generation's content selection ("what the generation says"), as well as content placement ("how the generation says it").
We consider in particular three induced metrics: • Content Selection (CS): precision and recall of unique relations r extracted from y 1:T that are also extracted from y 1:T . This measures how well the generated document matches the gold document in terms of selecting which records to generate.
• Relation Generation (RG): precision and number of unique relations r extracted from y 1:T that also appear in s. This measures how well the system is able to generate text containing factual (i.e., correct) records.
• Content Ordering (CO): normalized Damerau-Levenshtein Distance (Brill and Moore, 2000) 2 between the sequences of records extracted from y 1:T and that extracted fromŷ 1:T . This measures how well the system orders the records it chooses to discuss.
We note that CS primarily targets the "what to say" aspect of evaluation, CO targets the "how to say it" aspect, and RG targets both.
We conclude this section by contrasting the automatic evaluation we have proposed with recently proposed adversarial evaluation approaches, which also advocate automatic metrics backed by classification (Bowman et al., 2016;Kannan and Vinyals, 2016;. Unlike adversarial evaluation, which uses a blackbox classifier to determine the quality of a generation, our metrics are defined with respect to the predictions of an information extraction system. Accordingly, our metrics are quite interpretable, since by construction it is always possible to determine which fact (i.e., entity-value pair) in the generation is determined by the extractor to not match the database or the gold generation.

Neural Data-to-Document Models
In this section we briefly describe the neural generation methods we apply to the proposed task. As a base model we utilize the now standard attentionbased encoder-decoder model (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2015). We also experiment with several recent extensions to this model, including copy-based generation, and training with a source reconstruction term in the loss (in addition to the standard per-target-word loss).
Base Model For our base model, we map each record r ∈ s into a vectorr by first embedding r.t (e.g., POINTS), r.e (e.g., RUSSELL WESTBROOK), and r.m (e.g., 50), and then applying a 1-layer MLP (similar to Yang et al. (2016)). 3 Our source data-records are then represented ass = {r j } J j=1 . Givens, we use an LSTM decoder with attention and input-feeding, in the style of Luong et al. (2015), to compute the probability of each target word, conditioned on the previous words and on s. The model is trained end-to-end to minimize the negative log-likelihood of the words in the gold text y 1:T given corresponding source material s.
Copying There has been a surge of recent work involving augmenting encoder-decoder models to copy words directly from the source material on which they condition (Gu et al., 2016;Gülçehre et al., 2016;Merity et al., 2016;Jia and Liang, 2016;Yang et al., 2016). These models typically introduce an additional binary variable z t into the per-timestep target word distribution, which indicates whether the target wordŷ t is copied from the source or generated: p(ŷ t , z t = z |ŷ 1:t−1 , s).
In our case, we assume that target words are copied from the value portion of a record r; that is, a copy impliesŷ t = r.m for some r and t.

Joint Copy Model
The models of Gu et al. (2016) and Yang et al. (2016) parameterize the joint distribution table overŷ t and z t directly: where copy and gen are functions parameterized in terms of the decoder RNN's hidden state that assign scores to words, and where the notationŷ t ∈ s indicates thatŷ t is equal to r.m for some r ∈ s.
Conditional Copy Model Gülçehre et al. (2016), on the other hand, decompose the joint probability as: where an MLP is used to model p(z t |ŷ 1:t−1 , s). Models with copy-decoders may be trained to minimize the negative log marginal probability, marginalizing out the latent-variable z t (Gu et al., 2016;Yang et al., 2016;Merity et al., 2016). However, if it is known which target words y t are copied, it is possible to train with a loss that does not marginalize out the latent z t . Gülçehre et al. (2016), for instance, assume that any target word y t that also appears in the source is copied, and train to minimize the negative joint log-likelihood of the y t and z t .
In applying such a loss in our case, we again note that there may be multiple records r such that r.m appears inŷ 1:T . Accordingly, we slightly modify the p copy portion of the loss of Gülçehre et al. (2016) to sum over all matched records. In particular, we model the probability of relations r ∈ s such that r.m = y t and r.e is in the same sentence as r.m. Letting r(y t ) = {r ∈ s : r.m = y t , same−sentence(r.e, r.m)}, we have: p copy (y t | z t , y 1:t−1 , s) = r∈r(yt) p(r | z t , y 1:t−1 , s).
We note here that the key distinction for our purposes between the Joint Copy model and the Conditional Copy model is that the latter conditions on whether there is a copy or not, and so in p copy the source records compete only with each other. In the Joint Copy model, however, the source records also compete with words that cannot be copied. As a result, training the Conditional Copy model with the supervised loss of Gülçehre et al. (2016) can be seen as training with a word-level reconstruction loss, where the decoder is trained to choose the record in s that gives rise to y t .
Reconstruction Losses Reconstruction-based techniques can also be applied at the documentor sentence-level during training. One simple approach to this problem is to utilize the hidden states of the decoder to try to reconstruct the database. A fully differentiable approach using the decoder hidden states has recently been successfully applied to neural machine translation by Tu et al. (2017). Unlike copying, this method is applied only at training, and attempts to learn decoder hidden states with broader coverage of the input data.
In adopting this reconstruction approach we segment the decoder hidden states h t into T B contiguous blocks of size at most B. Denoting a single one of these hidden state blocks as b i , we attempt to predict each field value in some record r ∈ s from b i . We define p(r.e, r.m | b i ), the probability of the entity and value in record r given b i , to be softmax(f (b i )), where f is a parameterized function of b i , which in our experiments utilize a convolutional layer followed by an MLP; full details are given in the Appendix. We further extend this idea and predict K records in s from b i , rather than one. We can train with the following reconstruction loss for a particular b i : where p k is the k'th predicted distribution over records, and where we have modeled each component of r independently. This loss attempts to make the most probable record in s given b i more probable. We found that augmenting the above loss with a term that penalizes the total variation distance (TVD) between the p k to be helpful. 4 Both L(θ) and the TVD term are simply added to the standard negative log-likelihood objective at training time.

Experimental Methods
In this section we highlight a few important details of our models and methods; full details are in the Appendix. For our ROTOWIRE models, the record encoder producesr j in R 600 , and we use a 2-layer LSTM decoder with hidden states of the same size as ther j , and dot-product attention and input-feeding in the style of Luong et al. (2015).
Unlike past work, we use two identically structured attention layers, one to compute the standard generation probabilities (gen or p gen ), and one to produce the scores used in copy or p copy . We train the generation models using SGD and truncated BPTT (Elman, 1990;Mikolov et al., 2010), as in language modeling. That is, we split each y 1:T into contiguous blocks of length 100, and backprop both the gradients with respect to the current block as well as with respect to the encoder parameters for each block.
Our extractive evaluator consists of an ensemble of 3 single-layer convolutional and 3 singlelayer bidirectional LSTM models. The convolutional models concatenate convolutions with kernel widths 2, 3, and 5, and 200 feature maps in the style of (Kim, 2014). Both models are trained with SGD.
Then, 6 player-specific sentences of the following form are emitted (again adapting a simple sentence from the training set): <player> scored <pts> points (<fgm>-<fga> FG, <tpm>-<tpa> 3PT, <ftm>-<fta> FT) to go with <reb> rebounds. encouraging, rather than penalizing the TVD between the p k , which might make sense if we were worried about ensuring the p k captured different records.
The 6 highest-scoring players in the game are used to fill in the above template. Finally, a typical end sentence is emitted: The <team1>' next game will be at home against the Dallas Mavericks, while the <team2> will travel to play the Bulls.

Results
We found that all models performed quite poorly on the SBNATION data, with the best model achieving a validation perplexity of 33.34 and a BLEU score of 1.78. This poor performance is presumably attributable to the noisy quality of the SBNATION data, and the fact that many documents in the dataset focus on information not in the box-and line-scores. Accordingly, we focus on ROTOWIRE in what follows.
The main results for the ROTOWIRE dataset are shown in Table 2, which shows the performance of the models in Section 4 in terms of the metrics defined in Section 3.2, as well as in terms of perplexity and BLEU.

Discussion
There are several interesting relationships in the development portion of Table 2. First we note that the Template model scores very poorly on BLEU, but does quite well on the extractive metrics, providing an upper-bound for how domain knowledge could help content selection and generation. All the neural models make significant improvements terms of BLEU score, with the conditional copying with beam search performing the best, even though all the neural models achieve roughly the same perplexity.
The extractive metrics provide further insight into the behavior of the models. We first note that on the gold documents y 1:T , the extractive model reaches 92% precision. Using the Joint Copy model, generation only has a record generation (RG) precision of 47% indicating that relationships are often generated incorrectly. The best Conditional Copy system improves this value to 71%, a significant improvement and potentially the cause of the improved BLEU score, but still far below gold.
The Utah Jazz ( 38 -26 ) defeated the Houston Rockets ( 38 -26 ) 117 -91 on Wednesday at Energy Solutions Arena in Salt Lake City . The Jazz got out to a quick start in this one , out -scoring the Rockets 31 -15 in the first quarter alone . Along with the quick start , the Rockets were the superior shooters in this game , going 54 percent from the field and 43 percent from the three -point line , while the Jazz went 38 percent from the floor and a meager 19 percent from deep . The Rockets were able to out -rebound the Rockets 49 -49 , giving them just enough of an advantage to secure the victory in front of their home crowd . The Jazz were led by the duo of Derrick Favors and James Harden . Favors went 2 -for -6 from the field and 0 -for -1 from the three -point line to score a game -high of 15 points , while also adding four rebounds and four assists .... Figure 2: Example document generated by the Conditional Copy system with a beam of size 5. Text that accurately reflects a record in the associated box-or line-score is highlighted in blue, and erroneous text is highlighted in red.
Notably, content selection (CS) and content ordering (CO) seem to have no correlation at all with BLEU. There is some improvement with CS for the conditional model or reconstruction loss, but not much change as we move to beam search. CO actually gets worse as beam search is utilized, possibly a side effect of generating more records (RG#). The fact that these scores are much worse than the simple templated model indicates that further research is needed into better copying alone for content selection and better long term content ordering models.
Test results are consistent with development results, indicating that the Conditional Copy model is most effective at BLEU, RG, and CS, and that reconstruction is quite helpful for improving the joint model.

Human Evaluation
We also undertook two human evaluation studies, using Amazon Mechanical Turk. The first study attempted to determine whether generations considered to be more precise by our metrics were also considered more precise by human raters. To accomplish this, raters were presented with a particular NBA game's box score and line score, as well as with (randomly selected) sentences from summaries generated by our different models for those games. Raters were then asked to count how many facts in each sentence were supported by records in the box or line scores, and how many were contradicted. We randomly selected 20 distinct games to present to raters, and a total of 20 generated sentences per game were evaluated by raters. The left two columns of Table 3   average numbers of supporting and contradicting facts per sentence as determined by the raters, for each model. We see that these results are generally in line with the RG and CS metrics, with the Conditional Copy model having the highest number of supporting facts, and the reconstruction terms significantly improving the Joint Copy models.
Using a Tukey HSD post-hoc analysis of an ANOVA with the number of contradicting facts as the dependent variable and the generating model and rater id as independent variables, we found significant (p < 0.01) pairwise differences in contradictory facts between the gold generations and all models except "Copy+Rec+TVD," as well as a significant difference between "Copy+Rec+TVD" and "Copy". We similarly found a significant pairwise difference between "Copy+Rec+TVD" and "Copy" for number of supporting facts.
Our second study attempted to determine whether generated summaries differed in terms of how natural their ordering of records (as captured, for instance, by the DLD metric) is. To test this, we presented raters with random summaries generated by our models and asked them to rate the naturalness of the ordering of facts in the summaries on a 1-7 Likert scale. 30 random summaries were used in this experiment, each rated 3 times by distinct raters. The average Likert ratings are shown in the rightmost column of Table 3  While it is encouraging that the gold summaries received a higher average score than the generated summaries (and that the reconstruction term again improved the Joint Copy model), a Tukey HSD analysis similar to the one presented above revealed no significant pairwise differences. Figure 2 shows a document generated by the Conditional Copy model, using a beam of size 5. This particular generation evidently has several nice properties: it nicely learns the colloquial style of the text, correctly using idioms such as "19 percent from deep." It is also partially accurate in its use of the records; we highlight in blue when it generates text that is licensed by a record in the associated box-and line-scores. At the same time, the generation also contains major logical errors. First, there are basic copying mistakes, such as flipping the teams' win/loss records. The system also makes obvious semantic errors; for instance, it generates the phrase "the Rockets were able to out-rebound the Rockets." Finally, we see the model hallucinates factual statements, such as "in front of their home crowd," which is presumably likely according to the language model, but ultimately incorrect (and not supported by anything in the box-or linescores). In practice, our proposed extractive evaluation will pick up on many errors in this passage. For instance, "four assists" is an RG error, repeating the Rockets' rebounds could manifest in a lower CO score, and incorrectly indicating the win/loss records is a CS error.

Related Work
In this section we note additional related work not noted throughout. Natural language generation has been studied for decades (Kukich, 1983;McKeown, 1992;Reiter and Dale, 1997), and generating summaries of sports games has been a topic of interest for almost as long (Robin, 1994;Tanaka-Ishii et al., 1998;Barzilay and Lapata, 2005).
Within the world of neural text generation, some recent work has focused on conditioning language models on tables (Yang et al., 2016), and generating short biographies from Wikipedia Tables (Lebret et al., 2016;Chisholm et al., 2017). Mei et al. (2016) use a neural encoderdecoder approach on standard record-based generation datasets, obtaining impressive results, and motivating the need for more challenging NLG problems.

Conclusion and Future Work
This work explores the challenges facing neural data-to-document generation by introducing a new dataset, and proposing various metrics for automatically evaluating content selection, generation, and ordering. We see that recent ideas in copying and reconstruction lead to improvements on this task, but that there is a significant gap even between these neural models and templated systems. We hope to motivate researchers to focus further on generation problems that are relevant both to content selection and surface realization, but may not be reflected clearly in the model's perplexity. Future work on this task might include approaches that process or attend to the source records in a more sophisticated way, generation models that attempt to incorporate semantic or reference-related constraints, and approaches to conditioning on facts or records that are not as explicit in the box-and line-scores.