Revisiting Challenges in Data-to-Text Generation with Fact Grounding

Data-to-text generation models face challenges in ensuring data fidelity by referring to the correct input source. To inspire studies in this area, Wiseman et al. (2017) introduced the RotoWire corpus on generating NBA game summaries from the box- and line-score tables. However, limited attempts have been made in this direction and the challenges remain. We observe a prominent bottleneck in the corpus where only about 60% of the summary contents can be grounded to the boxscore records. Such information deficiency tends to misguide a conditioned language model to produce unconditioned random facts and thus leads to factual hallucinations. In this work, we restore the information balance and revamp this task to focus on fact-grounded data-to-text generation. We introduce a purified and larger-scale dataset, RotoWire-FG (Fact-Grounding), with 50% more data from the year 2017-19 and enriched input tables, and hope to attract research focuses in this direction. Moreover, we achieve improved data fidelity over the state-of-the-art models by integrating a new form of table reconstruction as an auxiliary task to boost the generation quality.


Introduction
Data-to-text generation aims at automatically producing descriptive natural language texts to convey the messages embodied in structured data formats, such as database records (Chisholm et al., 2017), knowledge graphs (Gardent et al., 2017a), and tables (Lebret et al., 2016;Wiseman et al., 2017). Table 1 shows an example from the RotoWire 1 (RW) corpus illustrating the task of generating document-level NBA basketball game summaries from the large box-and line-score tables 2 . It poses great challenges, requiring capabilities to select what to say (content selection) from two levels: what entity and which attribute, and to determine how to say on both discourse (content planning) and token (surface realization) levels.
Although this excellent resource has received great research attention, very few works (Li and Wan, 2018;Puduppully et al., 2019a,b;Iso et al., 2019) have attempted to tackle the challenges on ensuring data fidelity. This intrigues us to investigate the reason behind and we identify a major culprit undermining researchers' interests: the ungrounded contents in the human-written summaries impedes a model to learn to generate accurate fact-grounded statements and leads to possibly misleading evaluation results when the models are compared against each other.
Specifically, we observe that about 40% of the game summary contents cannot be directly mapped to any input boxscore records, as exemplified by Table 1. Written by professional sports journalists, these statements incorporate domain expertise and background knowledge consolidated from heterogeneous sources that are often hard to trace. The resulting information imbalance hinders a model to produce texts fully conditioned on the inputs and the uncontrolled randomness causes factual hallucinations, especially for the modern encoder-decoder framework (Sutskever et al., 2014;Cho et al., 2014). However, data fidelity is crucial for data-to-text generation besides fluency. In this real-world application, mistaken statements are detrimental to the document quality no matter how human-like they appear to be.
Apart from the popular BLEU (Papineni et al., 2002) metric for text generation, Wiseman et al. An example hallucinated statement : After going into halftime down by eight , the Rockets came out firing in the third quarter and out -scored the Nuggets 59 -42 to seal the victory on the road The Houston Rockets (18-5) defeated the Denver Nuggets (10-13) 108-96 on Saturday. Houston has won 2 straight games and 6 of their last 7. Dwight Howard returned to action Saturday after missing the Rockets ' last 11 games with a knee injury. He was supposed to be limited to 24 minutes in the game, but Dwight Howard persevered to play 30 minutes and put up a monstrous double-double of 26 points and 13 rebounds. Joining Dwight Howard in on the fun was James Harden with a triple-double of 24 points, 10 rebounds and 10 assists in 38 minutes. The Rockets ' formidable defense held the Nuggets to just 38 percent shooting from the field. Houston will face the Nuggets again in their next game, going on the road to Denver for their game on Wednesday. Denver has lost 4 of their last 5 games as they struggle to find footing during a tough part of their schedule ... Denver will begin a 4 -game homestead hosting the San Antonio Spurs on Sunday. (2017) also formalized a set of post-hoc information extraction (IE) based evaluations to assess the data fidelity. Using the boxscore table schema, a sequence of (entity, value, type) records mentioned in a system-generated summary are extracted as the content plan. They are then validated for accuracy against the boxscore table and similarity with the one extracted from the humanwritten summary. However, any hallucinated facts may unrealistically boost the BLEU score while not penalized by the data fidelity metrics since no records can be identified from the ungrounded contents. Thus the possibly misleading evaluation results inhibit systems to demonstrate excellence on this task.
These two aspects potentially undermine people's interests in this data fidelity oriented tableto-text generation task. Therefore, in this work, we revamp the task emphasizing this core aspect to further enable research in this direction. First, we restore the information balance by trimming the summaries of ungrounded contents and replenish the boxscore table to compensate for missing inputs. This requires the non-trivial extraction of the latent gold standard content plans with highquality. Thus, we take the efforts to design sophisticated heuristics and achieved an estimated 98% precision and 95% recall of the true content plans, retaining 74% of numerical words in the summaries. This yields better content plans as compared to the 94% precision, 80% recall by Puduppully et al. (2019b) and 60% retainment by Wiseman et al. (2017) respectively. Guided by the highquality content plans, only fact-grounded contents are identified and retained as shown in Table 1.
Furthermore, by expending with 50% more games between the years 2017-19, we obtain the more focused RotoWire-FG (RW-FG) dataset.
This leads to more accurate evaluations and collectively paves the way for future works by providing a more user-friendly alternative. With this refurbished setup, the existing models are then reassessed on their abilities to ensure data fidelity. We discover that by only purifying the RW dataset, the models can generate more precise facts without sacrificing fluency. Furthermore, we propose a new form of table reconstruction as an auxiliary task to improve fact grounding. By incorporating it into the state-of-the-art Neural Content Planning (NCP) (Puduppully et al., 2019a) model, we established a benchmark on the RW-FG dataset with a 24.41 BLEU score and 95.7% factual accuracy.
Finally, these insights lead us to summarize several fine-grained future challenges based on concrete examples, regarding factual accuracy and intra-and inter-sentence coherence.
Our contributions include: 1. We introduce a purified, enlarged and enriched new dataset to support the more focused fact-grounded table-to-text generation task. We provide high-quality summary facts to table records mappings (content plan) and a more user-friendly experimental setup. All codes and data are freely available 3 .
2. We re-investigate existing methods with more insights, establish a new benchmark on this task, and uncover more fine-grained challenges to encourage future research.  2 Data-to-Text Dataset This task requires models to take as inputs the NBA basketball game boxscore tables containing hundreds of records and generate the corresponding game summaries. A table can be view as a set of (entity, value, type) records where entity is the row name and type is the column name in Table 1.
be the set of entities for a game. S = {r j } S j=1 be the set of records where each r j has a value r m j , an entity name r e j , a record type r t j and r h j indicating if the entity is the HOME or AWAY team. For example, a record has r t j = POINTS, r e j = Dwight Howard, r m j = 26, and r h j = HOME. The summary has T words:ŷ 1:T = y 1 , . . . ,ŷ T . A sample is a (S,ŷ 1:T ) pair.

Looking into the RotoWire Corpus
To better understand what kind of ungrounded contents are causing the interference, we manually examine a set of 30 randomly picked samples 4 and categorize the sentences into 5 types whose counts and percentages are tabulated in Table 2. The His type occupies the majority portion, followed by the game-specific Game, Inf , and Agg types, and the remaining goes to Sch. Specifically, the His and Agg types come from exponentially large number of possible combinations of game statistics, and the Inf type is based on subjective judgments. Thus, it is difficult to trace and aggregate the heterogeneous sources of origin for such statements to fully balance the input and output. The Sch and Game types require a sample from a large pool of non-numerical and time-related information, whose exclusion would not affect the nature of the fact-grounding generation task. On the other hand, these ungrounded contents misguide a system to generate hallucinated facts and thus defeat the purpose of developing and evaluating models for fact-grounded table-to-text generation. Thus, we emphasize on this core aspect of the task by trimming contents not licensed by the boxscore table, which we show later still encompasses many fine-grained challenges awaiting to be resolved. While fully restoring all desired inputs is also an interesting research challenge, it is orthogonal to our focus and thus left for future explorations.

RotoWire-FG
Motivated by these observations, we perform purification and augmentation on the original dataset to obtain the new RW-FG dataset.

Dataset Purification
Purifying Contents: We aim to retain game summary contents with facts licensed by the boxscore records. The sports game summary genre is more descriptive than analytical and aims to concisely cover salient player or team statistics. Correspondingly, a summary often finishes describing one entity before shifting to the next. This fashion of topic shift allows us to identify the topic boundaries using sentences as units, and thus greatly narrows down the candidate boxscore records to be aligned with a fact. The mappings can then be identified using simple pattern-based matching, as also explored by Wiseman et al. (2017). It also enables resolving co-reference by mapping the singular and plural pronouns to the most recently mentioned players and teams respectively. A numerical value associated with an entity is licensed by the boxscore table if it equals to the record value of the desired type. Thus we design a set of heuristics to determine the types, such as mapping "Channing Frye furnished 12 points" to the (Channing Frye, 12, POINTS) record in the table. Finally, consecutive sentences describing the same entity is retained if any numerical value is licensed by the boxscore table.
This trimming process introduces negligible influences on the inter-sentence coherence for the summaries. We achieve a 98% precision and a 95% recall of the true content plans and align 74% of all numerical words in the summaries to records in the boxscore tables. The sequence of mapped records is extracted as the content plans and samples describing fewer than 5 records are discarded.
In between the labor-intensive yet imperfect manual annotation and the cheap but inaccu-   rate lexical matching, we achieved better quality through designing the heuristics using similar efforts as training and assembling the IE models by Wiseman et al. (2017). Meanwhile, more accurate content plans provide better reliability during evaluation.
Normalization: To enhance accuracy, we convert all English number words into numerical values. As some percentages are rounded differently between the summaries and the boxscore tables, such discrepancies are rectified. We also perform entity normalization for players and teams, resolving mentions of the same entity to one lexical form. This makes evaluations more user-friendly and less prone to errors.

Dataset Augmentation
Enlargement: Similar to Wiseman et al. (2017), we crawl the game summaries from the RotoWire Game Recaps 5 between years 2017-19 and align the summaries with the official NBA 6 boxscore tables. This brings 2.6K more games with 56% more tokens, as tabulated in Table 4. Line-score replenishment: Many team statistics in the summaries are missing in the line-score tables. We recover them by aggregating other boxscore statistics. For example, the number of shots attempted and made by the team for field goals, 3-pointers, and free-throws are calculated by summing their player statistics. Besides, we supplement a set of team point breakdowns as shown in   (green) to the source of statistics in the column names (yellow). "Sums" operates on individual teams and "Diffs" is between the two teams. For example, the "1 to 2" cell in the second row means the summation of points scored by a team in the 1st and 2nd "Quarters", the "1st" cell in the fourth row means the difference between the two teams' 1st half points.
tations. More data collection details are included in Appendix A.
3 Re-assessing Models on Purified RW

Models
We re-assess three neural network based models on this task 7 . To feed the tables to the models, each record r j has attribute embeddings for r m j , r e j , r t j , r h j and their concatenation is the input.
• NCP (Puduppully et al., 2019a): The Neural Content Planning (NCP) model employs a pointer network (Vinyals et al., 2015) to select a subset of records from the boxscore table and sequentially roll them out as the content plan. Then the summary is then generated only from the content plan using the ED-CC model with a Bi-LSTM encoder.
• ENT (Puduppully et al., 2019b): The EN-Tity memory network (ENT) model extends the ED-CC model with a dynamically updated entity-specific memory module to capture topic shifts in outputs and incorporate it into each decoder step with a hierarchical attention mechanism.

Evaluation
In addition to using BLEU (Papineni et al., 2002) as a reasonable proxy for evaluating the fluency of the generated summary, Wiseman et al. (2017) designed three types of metrics to assess if a summary accurately conveys the desired information.
Extractive Metrics: First, an ordered sequence of (entity, value, type) triples are extracted from the system output summary as the content plan using the same heuristics in section 2.2.1. It is then checked against the table for its accuracy (RG) and the gold content plan to measure how well they match (CS & CO). Specifically, let cp = {r i } and cp = {r i } be the gold and system content plan respectively, and |.| denote set cardinality. We calculate the following measures: • Content Selection (CS): - -DLD: normalized Damerau Levenshtein Distance (Brill and Moore, 2000) between cp and cp CS and RG measures the "what to say" and CO measures the "how to say" aspects.

Experiments
Setup: To re-investigate the existing three methods on the ability to convey accurate information conditioned on the input, we assess them by training on the purified RW corpus. To demonstrate the differences brought by the purification process, we keep all other settings unchanged and report results on the original validation and test sets after performing early stopping (Yao et al., 2007) based on the BLEU score. Results: As shown in Table 6, we observe increase in Relation Generation Precision (RGP) and on-par performance for Content Selection (CS) and Content Ordering (CO). In particular, Relation Generation Precision (RGP) is substantially increased by an average 2.7% for all models. The Content Selection (CS) and Content Ordering (CO) measures fluctuate above and below the references, with the biggest disparity on Content Selection Precision (CSP), Content Selection Recall (CSR) and Content Ordering (CO) for the ENT model. Since output length is a main independent variable for this set of experiments and a crucial factor in BLEU score as well, we report the breakdowns in Table 7. Specifically, the NCP model shows consistent improvements on all BLEU 1-4 scores, similarly for ENT on the validation set. Among all fluctuation around the references, nearly all models demonstrate an increase in BLEU-1 and BLEU-4 precision. Reflected on the BP coefficients, models trained on the purified summaries produces shorter outputs, which is the major reason for lower BLEU scores when using the un-purified summaries as the references.

How Purification Affects Performance
First, simply replacing with the purified training set leads to considerable improvements in the Relation Generation Precision (RGP). This is because removing the ungrounded facts (e.g. His, Agg, and Game types) alleviates their interference with the model while learning when and where to copy over a correct numerical value from the table. Besides, since the ungrounded facts do not contribute to the gold or system output content plan during the information extraction process, the other extractive metrics Content Selection (CS) and Content Ordering (CO) measures stay on-par.
One abnormality is the big difference in the Content Selection (CS) and Content Ordering (CO) measures from the ENT model. This is not that surprising after examining the outputs, which appear to collapse into template-like summaries. For example, 97.8% sentences start with the game points followed by a pattern "XX were the superior shooters" where XX represents a team. Tracing back to the model design, it is explicitly trained to model topic shifts on the token level during generation, which instead happens more often on the sentence level. As a result, it degenerates to remembering a frequent discourse-level pattern from the training data. We observe a similar pattern on the outputs from original outputs by Puduppully et al. (2019b), which is aggravated

A New Benchmark on RW-FG
With more insights about the existing methods, we take a step further to achieve better data fidelity. Wiseman et al. (2017) achieved improvements on the ED with Joint Copy (JC) (Gu et al., 2016) model by introducing an reconstruction loss (Tu et al., 2017) during training. Specifically, the decoder states at each time step are used to predict record values in the table to enable broader input information coverage. However, we take a different point of view: one key mechanism to avoid reference errors is to ensure that the set of numerical values mentioned in a sentence belongs to the correct entity with the correct record field type. While the ED-CC model is trained to achieve such alignments, it should also be able to accurately fill the numbers back to the correct cells in an empty table. This should be done by only accessing the column and row information of the cells without explicitly knowing the original cell values. Further leveraging on the planner output of the NCP model, the candidate cells to be filled can be reduced to the content plan cells selected by the planner. With this intuition, we devise a new form of table reconstruction (TR) task incorporated into the NCP model.
Specifically, each content plan record has at-tribute embeddings for r e j , r t j , and r h j , excluding its value, and we encode them using a 1-layer MLP . We then employ the Luong et al. (2015) attention mechanism at eachŷ t if it is a numerical value with the encoded content plan as the memory bank. The attention weights are then viewed as probabilities of selecting each cell to fill the numberŷ t . The model is additionally trained to minimize the negative log-likelihood of the correct cell.

Experiments
Setup: We assess models on the RW-FG corpus to establish a new benchmark. Following Wiseman et al. (2017), we split all samples into train (70%), validation (15%), and test (15%) sets, and perform early stopping (Yao et al., 2007) using BLEU (Papineni et al., 2002). We adapt the template-based generator by Wiseman et al. (2017) and remove the ungrounded end sentence since they are eliminated in RW-FG. Results: As shown in Table 8, the template model can ensure high Relation Generation Precision (RGP) but is inflexible as shown by other measures. Different from Puduppully et al. (2019b), the NCP model is superior on all measures among the baseline neural models. The ENT model only outperforms the basic ED-CC model but surprisingly yields lower Content Selection (CS) measures. Our NCP+TR model outperforms all baselines except for slightly lower Content Selection Precision (CSP) compared to the NCP model.

Discussion
We observe that the ED-CC model produces the least number of candidate records, and correspondingly achieves the lowest Content Selection Recall (CSR) compared to the gold standard content plans. As discussed in section 3.4, the template-like discourse pattern produced by the ENT model noticeably deteriorates its performance. It is completely outperformed by the NCP model and even achieves lower CO-DLD than the ED-CC model. Finally, as supported by the extractive evaluation metrics, employing table reconstruction as an auxiliary task indeed boosts the decoder to produce more accurate factual statements. We discuss in more detail as follows.

Manual Evaluation
To gain more insights into how exactly NCP+TR improves from NCP in terms of factual accuracy, we manually examined the outputs on the 30 samples. We compare the two systems after categorizing the errors into 4 types. As shown in Table 9, the largest improvement comes from reducing repeated statements and wrong fact claims, where the latter involves referring to the wrong entity or making the wrong judgment of the numerical value. The NCP+TR generally produces more concise outputs with a reduction in repetitions, consistent with the objective for table reconstruction. Table 10 shows a pair of outputs by the two systems. In this example, the NCP+TR model can correct wrong the player name "Jahlil Okafor"

Case study
by "Joel Embiid", while keeping the statistics intact. It also avoids repeating on "Channing Frye" and the semantically incoherent expression about "Kevin Love" and "Kyrie Irving". Nonetheless, this NCP output selects more records to describe the progress of the game. This shows how the NCP+TR trained with more constraints behaves more accurately but conservatively.

Errors and Challenges
Having revamped the task with better focus, reassessed existing and improved models, we discuss 3 future directions in this task with concrete examples in Table 11: Content Selection: Since writers are subjective in choosing what to say given the boxscore, it is unrealistic to force a model to mimic all kinds of styles. However, a model still needs to learn from training to select both the salient (e.g. surprisingly high/low statistics for a team/player) and the popular (e.g. the big stars) statistics. One potential direction is to involve multiple human references to help reveal such saliency and make Content Ordering (CO) and Content Selection (CS) measures more interpretive. This is particularly applicable for the sports domain since a game can be uniquely identified by the teams and date but mapped to articles from different sources. Besides, multi-reference has been explored for evaluating data-to-text generation systems (Novikova et al., 2017) and for content selection and planning (Gehrmann et al., 2018). It has also been studied in machine translation for evaluation (Dreyer and Marcu, 2012) and training (Zheng et al., 2018). Content Planning: Content plans have been extracted by linearly rolling out the records and topic shifts are modeled as sequential changes between adjacent entities. However, this fashion does not reflect the hierarchical discourse structures of a document and thus ensures neither intra-nor intersentence coherence. As shown by the errors in (1) The Cleveland Cavaliers defeated the Philadelphia 76ers , 102 -101 , at Wells Fargo Center on Monday evening . LeBron James led the way with a 25 -point , 14 -assist double double that also included 8 rebounds , 2 steals and 1 block . Kevin Love followed with a 20point , 11 -rebound double double that also included 1 assist and 1 block . Channing Frye led the bench with 12 points , 2 rebounds , 2 assists and 2 steals Kyrie Irving managed 8 points , 7 rebounds , 2 assists and 2 steals . ... Joel Embiid 's 22 points led the Sixers , a total he supplemented with 6 rebounds , 2 assists , 4 blocks and 1 steal ... The Cleveland Cavaliers defeated the Philadelphia 76ers , 102 -101 , at Wells Fargo Center on Friday evening . The Cavaliers came out of the gates hot , jumping out to a 34 -15 lead after 1 quarter . However , the Sixers ( 0 -5 ) stormed back in the second to cut the deficit to just 2 points by halftime . However , the light went on for Cleveland at intermission , as they built a 9 -point lead by halftime . LeBron James led the way for the Cavaliers with a 25 -point , 14 -assist double double that also included 8 rebounds , 2 steals and 1 block . Kyrie Irving followed Kevin Love with a 20 -point , 11 -rebound double double that also included 1 assist and 1 block . Channing Frye furnished 12 points , 2 rebounds , 2 assists and 2 steals ... Channing Frye led the bench with 12 points , 2 rebounds , 2 assists and 2 steals . Jahlil Okafor led the Sixers with 22 points, 6 rebounds , 2 assists, 4 blocks and 1 steal ... Jahlil Okafor managed 14 points , 5 rebounds , 3 blocks and 1 steal . in Table 11, the links between entities and their numerical statistics are not strictly monotonic and switching the order results in errors.
On the other hand, autoregressive training for creating such content plans limits the model to capture frequent sequence patterns rather than allowing diverse arrangements. Moryossef et al. (2019) demonstrates isolating the content planning from the joint end-to-end training and employing multiple valid content plans during testing. Although the content plan extraction heuristics are dataset-dependent, it is worth exploring for data in a closed domain like RW. Surface Realization: Although the NCP+TR model has achieved nearly 96% Relation Gen-(1) Intra-sentence coherence: • The Lakers were the superior shooters in this game , going 48 percent from the field and 24 percent from the three point line , while the Jazz went 47 percent from the floor and just 30 percent from beyond the arc.
• The Rockets got off to a quick start in this game, out scoring the Nuggets 21-31 right away in the 1st quarter.
(2) Inter-sentence coherence: • LeBron James was the lone bright spot for the Cavaliers , as he led the team with 20 points . Kevin Love was the only Cleveland starter in double figures , as he tallied 17 points , 11 rebounds and 3 assists in the loss.
(3) Incorrect claim: • The Heat were able to force 20 turnovers from the Sixers, which may have been the difference in this game. Table 11: Cases for three major types of system errors eration Precision (RGP), it is still paramount to keep on improving data accuracy since one single mistake is destructive to the whole document. The challenge is more with the evaluation metrics. Specifically, all extractive metrics only validate if an extracted record maps to the true entity and type but disregards the semantics of its contexts. For example (2) in Table 11, even assuming the linear ordering of records, their context still causes inter-sentence incoherence. In particular, both Le-Bron and Kevin scored double digits and JJ Barea leads the scores rather than Dirk. For another example (3), the 20 turnovers records are selected to be Heat's but expressed falsely as Sixers'. As pointed out by Wiseman et al. (2017), this may require the integration of semantic or referencebased constraints during generation. The number magnitudes should be incorporated. For example, Nie et al. (2018) has devised an interesting idea to implicitly improve coherence by supplementing the input with pre-computed results from algebraic operations on the table. Moreover, Qin et al. (2018) proposed to automatically align the game summary with the record types in the input table on the phrase level. It can potentially be combined with the operation results to correct incoherence errors and improve the generations.

Related Works
Various forms of structured data has been used as input for data-to-text generation tasks, such as tree (Belz et al., 2011;Mille et al., 2018), graph (Konstas and Lapata, 2012), dialog moves (Novikova et al., 2017), knowledge base (Gardent et al., 2017b;Chisholm et al., 2017), database (Konstas and Lapata, 2012;Gardent et al., 2017a;, and table (Wiseman et al., 2017;Lebret et al., 2016). The RW corpus we studied is from the sports domain which has attracted great interests (Chen and Mooney, 2008;Mei et al., 2016;Puduppully et al., 2019b). However, unlike generating the one-entity descriptions (Lebret et al., 2016; or having the output strictly bounded by the inputs (Novikova et al., 2017), this corpus poses additional challenges since the targets contain ungrounded contents. To facilitate better usage and evaluation of this task, we hope to provide a refined alternative, similar to the purpose by Castro Ferreira et al. (2018).

Conclusion
In this work, we study the core fact-grounding aspect of the data-to-text generation task and contribute a purified, enlarged, and enriched RotoWire-FG corpus with a more fair and reliable evaluation setup. We re-assess existing models and found that the more focused setting helps the models to express more accurate statements and alleviate fact hallucinations. Improving the state-of-the-art model and setting a benchmark on the new task, we reveal fine-grained unsolved challenges hoping to inspire more research in this direction.