Enhancing Content Planning for Table-to-Text Generation with Data Understanding and Verification

Neural table-to-text models, which select and order salient data, as well as verbalizing them fluently via surface realization, have achieved promising progress. Based on results from previous work, the performance bottleneck of current models lies in the stage of content planing (selecting and ordering salient content from the input). That is, performance drops drastically when an oracle content plan is replaced by a model-inferred one during surface realization. In this paper, we propose to enhance neural content planning by (1) understanding data values with contextual numerical value representations that bring the sense of value comparison into content planning; (2) verifying the importance and ordering of the selected sequence of records with policy gradient. We evaluated our model on ROTOWIRE and MLB, two datasets on this task, and results show that our model outperforms existing systems with respect to content planning metrics.


Introduction
to-text generation refers to the task of generating text from structured data. Models for this task can be mainly categorized into two types: pipeline-style models, which decompose the generation process into sequential stages, including content planning (Stage 1, selecting and ordering salient content from the input) and surface realization (Stage 2, converting the content plan to surface string) (Kukich, 1983;McKeown, 1985); and end-to-end models, which entangle aforementioned stages and generate text directly from structured data through a neural encoder-decoder framework (Wiseman et al., 2017;Nie et al., 2018). As in Fig. 1, this task provides tables with redundant records. Each record has three elements: reflecting salient records. Many neural end-to-end models have achieved remarkable progress of generating fluent and natural text on this task (Puduppully et al., 2019b;Gong et al., 2019). However, previous work notices that the content planning stage is the key factor in table-to-text generation (Gkatzia, 2016), but end-to-end models are difficult to explicitly improve their content planning ability. Recently, Puduppully et al. (2019a) proposed Neural Content Planning (NCP), a twostage model that explicitly selects and orders salient records whilst keeping the ability to generate fluent text of end-to-end models. They show that content planning (referring to both "content selection and planning" in Puduppully et al. (2019a)) indeed correlates with the quality of final output. Yet, NCP simply maximizes the log-likelihood of pre-extracted sequences of content plans given all records. According to their reported results, the inferred content plans are still far from the oracle. Thus, we focus on bridging the gap between the inferred content plans and the upper-bounds in Stage 1, and thus improving the final generation results.
We observe that whether a record is important highly depends on its record value. However, NCP, as well as other neural generation models, treats numerical values in table as tokens, and the prominent role of values in content planning is not recognized. Let's take Fig. 1 for example. Compared to the gold text, NCP mistakenly states that "The Memphis Grizzlies defeated the Brooklyn Nets" while "Nets" clearly score more points than "Grizzlies" in this match. Also, NCP neglects important players such as "Lin" who performs the second best in team "Nets". We hypothesize that this is because the model lacks understanding of values in their given context (here context means the structured table information) when representing corresponding records. In addition, we find that NCP tends to include redundant information when describing those players. For example, NCP includes redundant "two assists" when describing "Lopez". A possible reason is that the use of maximum likelihood estimation (MLE) is not enough to help verify important records during training.
To address the aforementioned numeric value understanding and important record verification problems, we propose a generation model with Data Understanding and Verfication (DUV), improving content planning in the framework of NCP. Specifically, we design contextual numeric value representations obtained through a pre-trained ranking task. In the pre-trained model, we compare pairwise numerical values describing the same type of information and decide which has a higher value. In the record encoder when training the model, we replace the value representation with its contextual version from the pre-trained model. In this way, the constructed record representation is also context-aware. Besides, instead of using the simple MLE, we design integrated rewards to verify content planning results. We conducted experiments on ROTOWIRE and MLB, showing that our model outperforms existing systems regarding the content selection and ordering metric.

Background
This task's input consists of tables S of records. The basics of a record r include entity r.e, type r.c, value r.v and features r.f . Models need to generate text y = (y 1 , y 2 , ..., y |y| ) (|y| is number of words) to describe important records in tables. As stated in Sec.1, this task has two main stages: (i) content planning, and (ii) surface realization. Puduppully et al. (2019a) propose Neural Content Planning (NCP) to explicitly optimize these two stages in deep neural networks, making the generation process more interpretable with an intermediate content plan. Thus, we use it as base model.
In Stage 1 (content planning), NCP embeds tokens into embedding vectors and encodes each record r with one-layer MLP for ROTOWIRE: Here, r. * represents their embedding vectors. W a and b a are trainable parameters and [; ] denotes vector concatenation. The reason to choose MLP is that its records are game statistics without sequential relationship between records. For MLB, we follow Puduppully et al. (2019b) and use LSTM instead because its input includes sequential event data. Next, a content selection gate is applied on each r to control the amount of information flowing from the record r. A LSTM-based pointer network (Vinyals et al., 2015) is applied to sequentially decode a content plan, which is a sequence of important records extracted from the output text, denoted as r * = {r * 1 , . . . , r * T } (T is the number of records mentioned in y). Here, we follow Puduppully et al. (2019a) to extract content plans using an information extraction (IE) approach as oracles. In each time step, the decoder takes previously selected record's representation as input and use the attention weights to select the next important one.
In Stage 2 (surface realization), a standard encoder-decoder model is applied, taking the output content plan from Stage 1 as input and generating text with attention mechanism (Luong et al., 2015) and conditional copy mechanism (Gulcehre et al., 2016). From results in Puduppully et al. (2019a), it is observed that performance bottleneck lies in Stage 1. That is, if we feed gold content plans into Stage 2, final results are much better, but if inferred content plans are fed instead, performance decreases drastically. Therefore, we focus on improving NCP's Stage 1 for better final outputs.

Approach
We propose to improve content planning (Stage 1 of NCP) from two aspects: (i) during record encoding, we design a contextual numeric value representation to improve the understanding of entities' (players' and teams') performance; (ii) a reinforced training strategy with targeted supervision signals is used to compensate maximizing the MLE in pointer network to boost model's content planning ability. Fig. 2 illustrates the overall training procedure. We first pre-train a model to learn contextual numeric value representations to understand relationship between records' numeric values by pairwise ranking loss. Secondly, given the pretrained model and table S, we encode each record with its contextual numeric value representation. In decoding phase of Stage 1, the pointer network is guided to favor important records for content planning with the help of reinforced supervision signals. Stage 2 remains the same as in the base model. We describe details in following parts.

Contextual Numerical Value Representation
Current (2) The same numerical value describing different types of information should not be interpreted in the same way. For example, "5" assists may indicate good performance, while "5" points may suggest disappointing performance. Hence, it is important to model a numerical value in context of other numerical values describing the same type of information in order to understand what is behind those numerical values.
Here, we propose to learn contextual numerical value representations for this task.
We extract numerical values that describe the same type of information from the same table to form training samples (e.g. players' points in Nets) for a pre-trained task. Our main idea is to use transformer encoder (Vaswani et al., 2017) to compare each numerical value with others in each training sample. We first use it to fuse information of numerical values in the same sample and obtain their contextual numerical value representations. Next, we optimize the pairwise ranking loss using their contextual representations such that a large numerical value is with a higher ranking score. Taking all raw numerical value embedding r i .v's of each training sample as input, we construct the contextual numerical value embeddingsR = [r 1 , . . . ,r n ] via multi-layer transformer encoder: where n is the number of numerical values in the sample, LN is the layer normalization, MHSelfAtt is the multi-head self-attention function, and FFN is position-wise feed-forward network.
Given a pair of contextual numerical value representationsr i andr j , we use a fully connected layer f (r i ) = sigmoid(W pri + b p ) to calculate the ranking score for each numerical value in the current input sample. If r i .v ≥ r j .v, we expect f (r i ) to be higher than f (r j ). For training contextual numerical value representations, we use the hinge loss (Eq.5). ξ is the margin and T (·) gives +1 if · is true and −1 otherwise.
We construct training samples of the pre-trained task using all training tables. Note that numerical values from different types of information form different samples. When the pre-trained model is converged, we use it in the record encoder in Eq. 1 by replacing the token embedding r i .v with its contextual representationr i via Eq. 2 to Eq. 4.

Content Planning Verification
The original NCP uses the pointer network to explicitly infer a content plan by optimizing the MLE of gold content plans. As noticed in other generation tasks (Sordoni et al., 2015;Li et al., 2016a;Dai et al., 2017), generation models with the MLE as the objective function tend to generate universal output sequences observed in the training data and it is desirable to integrate developer-defined rewards that better mimic the true goal of an ideal output sequence (Li et al., 2016b), which is the sequence of the content plan in our task. In order to explicitly reflect the quality of content plans, we explore rewards that measure the following five criteria, and optimized the model according to them via policy gradient (Sutton and Barto, 1998).
• Entity Importance (EI) evaluates if a predicted record r t contains an important entity by comparing whether the entity is mentioned in the gold content plan {r * i }. R(·) function gives +1 reward when · is true and -1 otherwise. EI(r t ) = R(r t .e ∈ {r * i .e}).
• Entity Recall (ER) measures how many important entities are covered by the decoded content plan r = {r t }. 1(·) is the indicator function which is 1 when · is true, otherwise 0.
• Record Importance (RI) and Record Recall (RR) are similar to EI and ER respectively but focus on each individual record instead of entity only: • Record Ordering (RO) calculates the normalized Damerau-Levenshtein Distance (Brill and Moore, 2000) between the predicted content plan r and the reference r * in order to measure how well the model organizes the chosen records.
The above designed rewards measure the content plan on different granularity. EI and ER focus on whether the selected entity (player/team) is an important one. It is also crucial to decide which of the entity's records are needed to be mentioned. Therefore, we also include RI and RR. Afterwards, we sample record sequence, combine all rewards and use policy gradient to guide the optimization of content selection given S as the input table: R tok = γ 1 EI(r t ) + γ 2 RI(r t ) (11) R seq = γ 3 ER(r)+γ 4 RR(r)+γ 5 RO(r)(12) Given a batch of input tables {S} G and gold content plan {r * } G , we first train the pointer network by optimizing the MLE: L gen = − 1 G G g=1 1 Tg Tg t=1 log P (r * t,g |r * <t,g , S g ). Then, we further finetune it with both the MLE loss and policy gradient: L = γ 6 L rl + (1 − γ 6 )L gen . Please note that T represents length of the content plan. γ 1 -γ 6 and β are hyper-parameters.

Setup
Dataset and Evaluation Metrics We conducted experiments on both ROTOWIRE 1 and MLB (Puduppully et al., 2019b) dataset. The former provides pairs of NBA game statistics and summary. Differently, the latter provides summary and heterogeneous input, consisting of MLB game statistics and event data (including event type, actors, etc.) in chronological order. For ROTOWIRE, we follow official training, development and test splits of 3398/727/728 instances. For MLB, as the contents are not released, we are able to retrieve a split of 22820/1739/1744 instances via official scripts 2 .
For evaluations, we use BLEU (Papineni et al., 2002) and three extractive metrics, which evaluate the generated results from the following aspects: (1)

Results
Comparing Methods In this section, we compare: • Template: We follow Wiseman et al. (2017) and Puduppully et al. (2019b) for constructing templatebased generators for ROTOWIRE and MLB respectively. The details and Conditional Copy (CC) model can be found in those papers.
• NCP+CC (NCP): our base model. Here, we provide both results reported in the original paper and reproduced by us, denoted as NCP(R). We also try a variant of NCP by using separate sets of embeddings in the encoders of two stages, denoted as S-NCP. We observe that S-NCP is comparable with reproduced NCP, with the ability to explicitly improve Stage 1 without affecting Stage 2. Thus. we use it to further verify our proposed model. • Entity Modeling (ENT) (Puduppully et al., 2019b) and Hierarchical Encoder on Three Dimensions (HETD) (Gong et al., 2019) are two stateof-the-art models on ROTOWIRE and/or MLB. OpAtt (Nie et al., 2018) introduces pre-executed operations for text generation.
• Data Understanding with content plan Verification (DUV): our proposed full model. We also include two variants for ablations: S-NCP + Verification (S-N+V) to study our model without data understanding, and Data Understanding (DU) to study without content plan verification.
Automatic Evaluation For ROTOWIRE, as shown in Table 1, template system achieves high RG P% (high-fidelity) due to rigid rules. Also, it achieves high CS R% since it includes vast amount of information (high RG #) and some of which are redundant (low CS P%). Compared with it, most neural models perform significantly better at filtering redundant records (CS P%) while still covering many important records, leading to better CS F1%. The higher CO also shows that neural models can better organize data records conditioned on the data. Among all neural models, DUV exceeds other neural models in terms of content selection (CS F1%) and content ordering (CO) on test set. Also, by comparing DUV with its base model (S-NCP), our model improves more on CS P%. In terms of RG, our model also performs better than base model, but still has a gap to ENT and HETD. This is mainly affected by surface realization (Stage 2), which is beyond the scope of this paper.
For MLB, we find similar pattern as discussed above. The differences are (1) improvements on CS and CO are less significant than on ROTOWIRE. Since MLB includes additional event data that RO-TOWIRE doesn't have, we separate out the statistical data in Table 4 for fair comparison. We find that base model (S-NCP) achieves 73.43% (Table 4) regarding statistical data on MLB v.s. 44.37% (Table  3) on ROTOWIRE of CS F1% in Stage 1, leaving much less room for improvement. (2) NCP-style models achieve less BLEU than ENT on MLB. The latter (Brevity Penalty, BP 0.736) generates longer text compared with DUV (BP 0.623). This is mainly due to surface realization (Stage 2), which we leave for future work. (S-N+V and DU). Results show that both data understanding and verification modules contribute to the overall improvement. Due to page limit, we include validation performance in Appendix. Human Evaluation Each example below is evaluated by 3 different annotators from a commercial annotation company, who are proficient in English and we report the average of three annotators' results in following settings. First, We sample 30 examples from test set and asked annotators to determine how many information in the summary are correct (#Sup) and how many are contradicting (#Cont) to the table. On ROTOWIRE, our model describes the table more concisely (closest #Sup to gold text) while produces significantly less contradicting facts than NCP thanks to significant improvement on Stage 1. We observe that gold text contains incorrect facts (e.g. wrong field-goal percentage) while #Cont of TEMP is due to annotation error. Gap between ENT and DUV on #Cont shows potential of Stage 2, which is beyond the scope of this paper. Second, we arrange results from models of each example into 10/15 pairs (ROTOWIRE/MLB) and asked annotators to determine which one in the pair performs better in terms of grammaticality, coherence and conciseness. The reported result is the subtraction of the percentage of time a system is considered better and when considered worse. On ROTOWIRE, DUV can generate most coherent text among neural models, but less satisfying on grammaticality and conciseness, compared with ENT. This is mainly affected by surface realization (Stage 2). A possible way is to use large-scale pretrained language models such as GPT-2 (Radford et al., 2019) to address this issue. In MLB, DUV achieves comparable performance with NCP across 5 metrics due to the same Stage 2.  Table 2: Human evaluation results. Models with perform significantly different from DUV (p < 0.05), using a one-way ANOVA with posthoc Tukey HSD tests. We omit CC on ROTOWIRE because NCP is proven to be better (Puduppully et al., 2019a).

Analysis
Visualization Fig. 3 visualizes value's token embeddings (in red) and our contextual numerical value representations (in blue). Token embeddings are closer between each other while the contextual ones are more discriminative and naturally ordered from low to high along the "blue arc". We hypothesize this phenomenon contributes to the improvement of content selection. Content Planning In Table 3  and CO. Considering both CS P% and R%, DUV can generate more concise but informative content plans with little sacrifice on recall. Next, by subtracting each reward from DUV, we observe that all rewards contribute to DUV's improvement on content selection and ordering. ROTOWIRE v.s. MLB Our model's improvements on CS and CO are significant on RO-TOWIRE, but less significant on MLB. Different from ROTOWIRE, MLB additionally provides sequential event data. The two different sources of input can be regards as heterogeneous . The average statistical data in gold text is 12.69 while event data is 4.16 (extracted by IE model on test set). In Table 4, we discuss CS and CO for two types of data respectively. DU and verification both improve over base model, with verification contributing more overall. They consistently improve on CS F1% and CO on statistical data, but the high CS of base model indicates little room for improvement. Meanwhile, event data is the bottleneck and the drop on that also attributes to the not so significant overall CS and CO improvement. It reveals potential for content planning on heterogeneous input on MLB as future work.

Case Study
Compared with NCP and gold text in Fig. 1, DUV (Fig. 4) has nice properties: (1) It accurately states that "Nets" with higher points defeated "Grizzlies" while NCP fails. This is due to our model's ability to compare value; (2) Our model can better filter unimportant records (CS P%) while cover the important ones (CS R%) than both NCP and ENT. Note that our model covers all important players  points , five rebounds , three assists , a steal and a block . Jeremy Lin followed with 18 points , four rebounds and an assist . Caris LeVert led the bench with 14 points , three rebounds and an assist . Chandler Parsons was right behind him with 12 points , three assists , a rebound and a steal over 22 minutes . Randy Foye was the only other starter to manage double -digit scoring , supplying 14 points , three rebounds , an assist and two steals over 16 minutes . Zach Randolph ( 10 points , seven rebounds , three assists ) and Trevor Booker ( eight points , nine rebounds , an assist , a steal and a block ) were highly productive in reserve roles . Figure 4: Generation examples based on tables in Fig. 1. Important/unimportant entities and records are in red/blue. Text that accurately/incorrectly reflects the statistics in table is in bold/italic. Due to page limit, we include generation example on MLB in Appendix.
and their records in this case while only mention one not so impressive player's records; (3) By comparing the content planning (Stage 1) results and actual records mentioned in our model's text (Stage 2), the main challenge indeed lies in the content planning since surface realization can faithfully deliver most information (93.10%) in the same order.

Related Work
In the past few years, table-to-text generation has attracted many attentions. To improve text fidelity, Li and Wan (2018) propose to generate templates and then fill the slots, while Nie et al. (2018) use pre-executed operations. However, our work mainly focuses on improving the content planning. Puduppully et al. (2019b) propose to specifically model entities when decoding texts. Different from them, we model numerical values during encoding. Iso et al. (2019) incorporate writers' information to generate text step-by-step. Our work can also consider such information in surface realization (Stage 2). For a fair comparison of all methods, we do not include the use of this model here. Gong et al. (2019) utilize hierarchical encoders with dual attention to consider both the table structure and history information. In terms of building numerical value representations, Spithourakis and Riedel (2018) explore number prediction for language models while Naik et al. (2019) explore numerical embeddings to capture the numeration and magnitude properties of numbers. In our task, generation models rely heavily on copy mechanism to cover numerical values in text and achieve good results. Thus, how to understand numerical values to select records becomes important and we propose to understand them through their context.

Conclusion
In order to enhance neural content planning for table-to-text generation, we proposed (1) contextual numerical value representations to help model understand data values and (2) effective rewards to verify a model's inferred important records during training. Experimental results show that our model outperforms competitive baselines in terms of content planning. In the future, we would like to explore enhancement on surface realization jointly to generate better text.