Table-to-Text Generation with Effective Hierarchical Encoder on Three Dimensions (Row, Column and Time)

Although Seq2Seq models for table-to-text generation have achieved remarkable progress, modeling table representation in one dimension is inadequate. This is because (1) the table consists of multiple rows and columns, which means that encoding a table should not depend only on one dimensional sequence or set of records and (2) most of the tables are time series data (e.g. NBA game data, stock market data), which means that the description of the current table may be affected by its historical data. To address aforementioned problems, not only do we model each table cell considering other records in the same row, we also enrich table’s representation by modeling each table cell in context of other cells in the same column or with historical (time dimension) data respectively. In addition, we develop a table cell fusion gate to combine representations from row, column and time dimension into one dense vector according to the saliency of each dimension’s representation. We evaluated our methods on ROTOWIRE, a benchmark dataset of NBA basketball games. Both automatic and human evaluation results demonstrate the effectiveness of our model with improvement of 2.66 in BLEU over the strong baseline and outperformance of state-of-the-art model.


Introduction
to-text generation is an important and challenging task in natural language processing, which aims to produce the summarization of numerical table (Reiter and Dale, 2000;Gkatzia, 2016). The related methods can be empirically divided into two categories, pipeline model and end-toend model. The former consists of content selection, document planning and realisation, mainly for early industrial applications, such as weather * Email corresponding.  Figure 1: Generated example on ROTOWIRE by using Conditional Copy (CC) as baseline (Wiseman et al., 2017). Text that accurately reflects records in the table is in red, and text that contradicts the records is in blue.
forecasting and medical monitoring, etc. The latter generates text directly from the table through a standard neural encoder-decoder framework to avoid error propagation and has achieved remarkable progress. In this paper, we particularly focus on exploring how to improve the performance of neural methods on table-to-text generation. Recently, ROTOWIRE, which provides tables of NBA players' and teams' statistics with a descriptive summary, has drawn increasing attention from academic community. Figure 1 shows an example of parts of a game's statistics and its corresponding computer generated summary. We can see that the tables has a formal structure including table row header, table column header and table cells. "Al Jefferson" is a table row header that represents a player, "PTS" is a table column header indicating the column contains player's score and "18" is the value of the table cell, that is, Al Jefferson scored 18 points. Several related models have been proposed . They typically encode the table's records separately or as a long sequence and generate a long descriptive summary by a standard Seq2Seq decoder with some modifications. Wiseman et al. (2017) explored two types of copy mechanism and found conditional copy model (Gulcehre et al., 2016) perform better . Puduppully et al. (2019) enhanced content selection ability by explicitly selecting and planning relevant records. Li and Wan (2018) improved the precision of describing data records in the generated texts by generating a template at first and filling in slots via copy mechanism. Nie et al. (2018) utilized results from pre-executed operations to improve the fidelity of generated texts. However, we claim that their encoding of tables as sets of records or a long sequence is not suitable. Because (1) the table consists of multiple players and different types of information as shown in Figure 1. The earlier encoding approaches only considered the table as sets of records or one dimensional sequence, which would lose the information of other (column) dimension.
(2) the table cell consists of time-series data which change over time. That is to say, sometimes historical data can help the model select content. Moreover, when a human writes a basketball report, he will not only focus on the players' outstanding performance in the current match, but also summarize players' performance in recent matches. Lets take Figure 1 again. Not only do the gold texts mention Al Jefferson's great performance in this match, it also states that "It was the second time in the last three games he's posted a double-double". Also gold texts summarize John Wall's "doubledouble" performance in the similar way. Summarizing a player's performance in recent matches requires the modeling of table cell with respect to its historical data (time dimension) which is absent in baseline model. Although baseline model Conditional Copy (CC) tries to summarize it for Gerald Henderson, it clearly produce wrong statements since he didn't get "double-double" in this match.
To address the aforementioned problems, we present a hierarchical encoder to simultaneously model row, column and time dimension information. In detail, our model is divided into three layers. The first layer is used to learn the representation of the table cell. Specifically, we employ three self-attention models to obtain three representations of the table cell in its row, column and time dimension. Then, in the second layer, we design a record fusion gate to identify the more important representation from those three dimension and combine them into a dense vector. In the third layer, we use mean pooling method to merge the previously obtained table cell representations in the same row into the representation of the table's row. Then, we use self-attention with content selection gate (Puduppully et al., 2019) to filter unimportant rows' information. To the best of our knowledge, this is the first work on neural

Notations
The input to the model are tables S = {s 1 , s 2 , s 3 }. s 1 , s 2 , and s 3 contain records about players' performance in home team, players' performance in visiting team and team's overall performance respectively. We regard each cell in the table as record. Each record r consists of four types of information including value r.v (e.g. 18), entity r.e (e.g. Al Jefferson), type r.c (e.g. POINTS) and a feature r.f (e.g. visiting) which indicate whether a player or a team compete in home court or not. Each player or team takes one row in the table and each column contains a type of record such as points, assists, etc. Also, tables contain the date when the match happened and we let k denote the date of the record. We also create timelines for records. The details of timeline construction is described in Section 2.2. For simplicity, we omit table id l and record date k in the following sections and let r i,j denotes a record of i th row and j th column in the table. We assume the records come from the same table and k is the date of the mentioned record. Given those information, the model is expected to generate text y = (y 1 , ..., y t , ..., y T ) describing these tables. T denotes the length of the text.

Record Timeline Constrcution
In this paper, we construct timelines tl = {tl e,c } E,C e=1,c=1 for records. E denotes the number of distinct record entities and C denotes the number of record types. For each timeline tl e,c , we first extract records with the same entity e and type c from dataset. Then we sort them into a sequence according to the record's date from old to new. This sequence is considered as timeline tl e,c . For example, in Figure 2, the "Timeline" part in the lower-left corner represents a timeline for entity Al Jefferson and type PTS (points).

Baseline Model
We use Seq2Seq model with attention (Luong et al., 2015) and conditional copy (Gulcehre et al., 2016) as the base model. During training, given tables S and their corresponding reference texts y, the model maximized the conditional probability P (y|S) = T t=1 P (y t |y <t , S) . t is the timestep of decoder. First, for each record of the i th row and j th column in the table, we utilize 1-layer MLP to encode the embeddings of each record's four types of information into a dense vector . W a and b a are trainable parameters. The word embeddings for each type of information are trainable and randomly initialized before training following Wiseman et al. (2017).
[; ] denotes the vector concatenation. Then, we use a LSTM decoder with attention and conditional copy to model the conditional probability P (y t |y <t , S). The base model first use attention mechanism (Luong et al., 2015) to find relevant records from the input tables and represent them as context vector. Please note that the base model doesn't utilize the structure of three tables and normalize the attention weight α t,i ,j across every records in every tables. Then it combines the context vector with decoder's hidden state d t and form a new attentional hidden statẽ d t which is used to generate words from vocabulary P gen (y t |y <t , S) = sof tmax(W ddt + b d ) Also the conditional copy mechanism is adopted in base model. It introduces a variable z t to decide whether to copy from tables or generate from vocabulary. The probability to copy from table is Then it decomposes the conditional probability of generating t th word P (y t |y <t , S), given the tables S and previously generated words y <t , as follows.
P (y t , z t |y <t , S) = P (z t = 1|y <t , S) yt←r i ,j α t,i ,j z t = 1 P (z t = 0|y <t , S)P gen (y t |y <t , S) z t = 0 ( 1) can first find important row then attend to important record when generating texts. We describe model's details in following parts.

Row Dimension Encoder
Based on our observation, when someone's points is mentioned in texts, some related records such as "field goals made" (FGM) and "field goals attempted" (FGA) will also be included in texts. Taken gold texts in Figure 1 as example, when Al Jefferson's point 18 is mentioned, his FGM 9 and FGA 19 are also mentioned. Thus, when modeling a record, other records in the same row can be useful. Since the record in the row is not sequential, we use a self-attention network which is similar to Liu and Lapata (2018) to model records in the context of other records in the same row. Let r row i,j be the row dimension representation of the record of i th row and j th column. Then, we obtain the context vector in row dimension c row i,j by attending to other records in the same row as follows. Please Then, we combine record's representation with c i,j and obtain the row dimension record repre- ). W f is a trainable parameter.

Column Dimension Encoder
Each input table consists of multiple rows and columns. Each column in the table covers one type of information such as points. Only few of the row may have high points or other type of information and thus become the important one. For example, in "Column Dimension" part of Figure 2, "Al Jefferson" is more important than "Gary Neal" because the former one have more impressive points. Therefore, when encoding a record, it is helpful to compare it with other records in the same column in order to understand the performance level reflected by the record among his teammates (rows). We employ self-attention similar to the one used in Section 3.1.1 in column dimension to compare between records. We let r col i,j be the column representation of the record of i th row and j th column. We obtain context vector in column dimen-sion c col i,j as follows. Please note that α j,i,i is normalized across records from different rows i but of the same column j. The column dimension representation r col i,j is obtained similar to row dimension.

Time Dimension Encoder
As mentioned in Section 1, we find some expressions in texts require information about players' historical (time dimension) performance. So the history information of record r i,j is important. Note that we have already constructed timeline for each record entity and type as described in Section 2.2. Given those timelines, We collect records with same entity and type in the timeline which has date before date k of the record r i,j as history information. Since for some record, the history information can be too long, we set a history window. Thus, we keep most recent history information sequence within history window and denote them as hist(r i,j ). We model this kind of information in time dimension via selfattention. However, unlike the unordered nature of rows and columns, the history information is sequential. Therefore, we introduce a trainable position embedding emb pos (k ) and add it to the record's representation and obtain a new record representation rp k . It denotes the representation of a record with the same entity and type of r i,j but of the date k before k in the corresponding history window. We use r time i,j to denote the history representation of the record of i th row and j th column. Then the history dimension context vector is obtained by attending to history records in the window. Please note that we use 1-layer MLP as score function here and α time k,k is normalized within the history window. We obtain the time dimension representation r time i,j similar to row dimension.

Layer 2: Record Fusion Gate
After obtaining record representations in three dimension, it is important to figure out which representation plays a more important role in reflecting the record's information. If a record stands out from other row's records of same column, the column dimension representation may have a higher weight in forming the overall record representation. If a record differs from previous match significantly, the history dimension representation may have a higher weight. Also, some types of information may appear in texts more frequently together which can be reflected by row dimension representation. Therefore, we propose a record fusion gate to adaptively combine all three dimension representations. First, we concatenate r row i,j , r col i,j and r time i,j , then adopt a 1-layer MLP to obtain a general representation r gen i,j which we consider as a baseline representation of records' information. Then, we compare each dimension representation with the baseline and obtain its weight in the final record representation. We use 1-layer MLP as the score function. Equation 6 shows an example of calculating column dimension representation's weight in the final record representation. The weight of row and time dimension representation is obtained similar to the weight of column dimension representation.
In the end, the fused record representationr i,j is the weighted sum of the three dimension representations.

Layer 3: Row-level Encoder
For each row, we combine its records via mean pooling (Equation 8) in order to obtain a general representation of the row which may reflect the row (player or team)'s overall performance. C denotes the number of columns.
Then, we adopt content selection gate g i , which is proposed by Puduppully et al. (2019) on rows' representations row i , and obtain a new representationrow i = g i row i to choose more important information based on each row's context.

Decoder with Dual Attention
Since record encoders with record fusion gate provide record-level representation and row-level encoder provides row-level representation. Inspired by Cohan et al. (2018), we can modify the decoder in base model to first choose important row then attend to records when generating each word. Following notations in Section 2.3, β t,i ∝ exp(score(d t , row i )) obtains the attention weight with respect to each row. Please note that β t,i is normalized across all row-level representations from all three tables. Then, γ t,i,j ∝ exp(score(d t ,r i,j )) obtains attention weight for records. Please note that we normalize γ t,i,j among records in the same row. We use the row-level attention β t,i as guidance for choosing row based on row's general representation. Then we use it to re-weight the record-level attention γ t,i,j and change the attention weight in base model toα t,i,j . Please note thatα t,i,j sum to 1 across all records in all tables.

Training
Given a batch of input tables {S} G and reference output {Y } G , we use negative log-likelihood as the loss function for our model. We train the model by minimizing L. G is the number of examples in the batch and T g represents the length of g th reference's length.

Dataset and Evaluation Metrics
We conducted experiments on ROTOWIRE (Wiseman et al., 2017). For each example, it provides three tables as described in Section 2.1 which consists of 628 records in total with a long game summary. The average length of game summary is 337.1. In this paper, we followed the data split introduced in Wiseman et al. (2017)

Implementation Details
Following configurations in Puduppully et al. (2019), we set word embedding and LSTM decoder hidden size as 600. The decoder's layer was set to be 2. Input feeding (Luong et al., 2015) was also used for decoder. We applied dropout at a rate 0.3. For training, we used Adagrad (Duchi et al., 2010) optimizer with learning rate of 0.15, truncated BPTT (block length 100), batch size of 5 and learning rate decay of 0.97. For inferring, we set beam size as 5. We also set the history windows size as 3 from {3,5,7} based on the results. Code of our model can be found at https://github.com/ernestgong/data2textthree-dimensions/. Table 1 displays the automatic evaluation results on both development and test set. We chose Conditional Copy (CC) model as our baseline, which is the best model in Wiseman et al. (2017). We included reported scores with updated IE model by Puduppully et al. (2019) and our implementation's result on CC in this paper. Also, we compared our models with other existing works on this dataset including OpATT (Nie et al., 2018) and Neural Content Planning with conditional copy (NCP+CC) (Puduppully et al., 2019). In addition, we implemented three other hierarchical encoders that encoded tables' row dimension information in both record-level and row-level to compare with the hierarchical structure of encoder in our model. The decoder was equipped with dual attention (Cohan et al., 2018). The one with LSTM cell is similar to the one in Cohan et al. (2018) with 1 layer from {1,2,3}. The one with CNN cell (Gehring et al., 2017) has kernel width 3 from {3, 5} and 10 layer from {5,10,15,20}. The one with transformer-style encoder (MHSA) (Vaswani et al., 2017) has 8 head from {8, 10} and 5 layer from {2,3,4,5,6}. The heads and layers mentioned above were for both record-level encoder and rowlevel encoder respectively. The self-attention (SA) cell we used, as described in Section 3, achieved better overall performance in terms of F1% of CS, CO and BLEU among the hierarchical encoders. Also we implemented a template system same as the one used in Wiseman et al. (2017) which outputted eight sentences: an introductory sentence (two teams' points and who win), six top players' statistics (ranked by their points) and a conclusion sentence. We refer the readers to Wiseman et al.

Automatic Evaluation
(2017)'s paper for more detailed information on templates. The gold reference's result is also included in Table 1. Overall, our model performs better than other neural models on both development and test set in terms of RG's P%, F1% score of CS, CO and BLEU, indicating our model's clear improvement on generating high-fidelity, informative and fluent texts. Also, our model with three dimension representations outperforms hierarchical encoders with only row dimension representation on development set. This indicates that cell and time dimension representation are important in representing the tables. Compared to reported baseline result in Wiseman et al. (2017), we achieved improvement of 22.27% in terms of RG, 26.84% in terms of CS F1%, 35.28% in terms of CO and 18.75% in terms of BLEU on test set. Unsurprisingly, template system achieves best on RG P% and CS R% due to the included domain knowledge. Also, the high RG # and low CS P% indicates that template will include vast information while many of them are deemed redundant. In addition, the low CO and low BLEU indicates that the rigid structure of the template will produce texts that aren't as adaptive to the given tables and natural as those produced by neural models. Also, we conducted ablation study on our model to evaluate each component's contribution on development set. Based on the results, the absence of row-level encoder hurts our model's performance across all metrics especially the content selection ability.
Row, column and time dimension information are important to the modeling of tables because subtracting any of them will result in performance   (Li and Wan, 2018). * We include delayed copy (DEL)'s result in the paper (Li and Wan, 2018) for comparison.
drop. Also, position embedding is critical when modeling time dimension information according to the results. In addition, record fusion gate plays an important role because BLEU, CO, RG P% and CS P% drop significantly after subtracting it from full model. Results show that each component in the model contributes to the overall performance. In addition, we compare our model with delayed copy model (DEL) (Li and Wan, 2018) along with gold text, template system (TEM), conditional copy (CC) (Wiseman et al., 2017) and NCP+CC (NCP) (Puduppully et al., 2019). Li and Wan (2018)'s model generate a template at first and then fill in the slots with delayed copy mechanism. Since its result in Li and Wan (2018)'s paper was evaluated by IE model trained by Wiseman et al. (2017) and "relexicalization" by Li and Wan (2018), we adopted the corresponding IE model and re-implement "relexicalization" as suggested by Li and Wan (2018) for fair comparison. Please note that CC's evaluation results via our reimplemented "relexicalization" is comparable to the reported result in Li and Wan (2018). We applied them on models other than DEL as shown in Table 2 and report DEL's result from (Li and Wan, 2018)'s paper. It shows that our model outperform Li and Wan (2018)'s model significantly across all automatic evaluation metrics in Table 2.

Human Evaluation
In this section, we hired three graduates who passed intermediate English test (College English Test Band 6) and were familiar with NBA games to perform human evaluation.
First, in order to check if history information is important, we sampled 100 summaries from train-  ing set and asked raters to manually check whether the summary contained expressions that need to be inferred from history information. It turns out that 56.7% summaries of the sampled summaries need history information.
Following human evaluation settings in Puduppully et al. (2019), we conducted the following human evaluation experiments at the same scale. The second experiment is to assess whether the improvement on relation generation metric reported in automatic evaluation is supported by human evaluation. We compared our full model with gold texts, template-based system, CC (Wiseman et al., 2017) and NCP+CC (NCP) (Puduppully et al., 2019). We randomly sampled 30 examples from test set. Then, we randomly sampled 4 sentences from each model's output for each example. We provided the raters of those sampled sentences with the corresponding NBA game statistics. They were asked to count the number of supporting and contradicting facts in each sentence. Each sentence is rated independently. We report the average number of supporting facts (#Sup) and contradicting facts (#Cont) in Table 3. Unsurprisingly, template-based system includes most supporting facts and least contradicting facts in its texts because the template consists of a large number of facts and all of those facts are extracted from the table. Also, our model produces less contradicting facts than other two neural models. Although our model produces less supporting facts than NCP and CC, it still includes enough supporting facts (slightly more than gold texts). Also, comparing to NCP+CC (NCP)s tendency to include vast information that contain redundant information, our models ability to select and accurately convey information is better. All other results (Gold, CC, NCP and ours) are significantly different from template-based system's results in terms of number of supporting facts according to one-way ANOVA with posthoc Tukey HSD tests. All significance difference reported in this paper are less than 0.05. Our model is also significantly different from the NCP model. As for average number of contradicting facts, our model is significantly different from other two neural models. Surprisingly, gold texts were found containing contradicting facts. We checked the raters's result and found that gold texts occasionally include wrong field-goal or three-point percent or wrong points difference between the winner and the defeated team. We can treat the average contradicting facts number of gold texts as a lower bound.
In the third experiment, following Puduppully et al. (2019), we asked raters to evaluate those models in terms of grammaticality (is it more fluent and grammatical?), coherence (is it easier to read or follows more natural ordering of facts? ) and conciseness (does it avoid redundant information and repetitions?). We adopted the same 30 examples from above and arranged every 5-tuple of summaries into 10 pairs. Then, we asked the raters to choose which system performs the best given each pair. Scores are computed as the difference between percentage of times when the model is chosen as the best and percentage of times when the model is chosen as the worst. Gold texts is significantly more grammatical than others across all three metrics. Also, our model performs significantly better than other two neural models (CC, NCP) in all three metrics. Template-based system generates significantly more grammatical and concise but significantly less coherent results, compared to all three neural models. Because the rigid structure of texts ensures the correct grammaticality and no repetition in template-based system's output. However, since the templates are stilted and lack variability compared to others, it was deemed less coherent than the others by the raters.

Qualitative Example
Our model: The Charlotte Hornets ( 21 -27 ) defeated the Washington Wizards ( 31 -18 ) 92 -88 on Monday … The Hornets were led by Al Jefferson , who recorded a double -double of his own with 18 points ( 9 -19 FG , 0 -2 FT ) and 12 rebounds . It was his second doubledouble over his last three games … The only other Wizard to reach double -digit points was Kris Humphries , who came off the bench for 13 points ( 4 -8 FG , 5 -6 FT ) and five rebounds in 26 minutes … Figure 3: An generation example of our model based on the same tables in Figure 1. Text that accurately reflects players (Al Jefferson and Kris Humphries) performance is in red. Figure 3 shows an example generated by our model. It evidently has several nice properties: it can accurately select important player "Al Jef-ferson" from the tables who is neglected by baseline model, which need the model to understand performance difference of a type of data (column) between each rows (players). Also it correctly summarize performance of "Al Jefferson" in this match as "double-double" which requires ability to capture dependency from different columns (different type of record) in the same row (player). In addition, it models "Al Jefferson" history performance and correctly states that "It was his second double-double over his last three games", which is also mentioned in gold texts included in Figure 1 in a similar way.

Related Work
In recent years, neural data-to-text systems make remarkable progress on generating texts directly from data. Mei et al. (2016) proposes an encoderaligner-decoder model to generate weather forecast, while Jain et al. (2018) propose a mixed hierarchical attention.  proposes a hybrid content-and linkage-based attention mechanism to model the order of content.  propose to integrate field information into table representation and enhance decoder with dual attention. Bao et al. (2018) develops a table-aware encoder-decoder model. Wiseman et al. (2017) introduced a document-scale data-totext dataset, consisting of long text with more redundant records, which requires the model to select important information to generate. We describe recent works in Section 1. Also, some studies in abstractive text summarization encode long texts in a hierarchical manner. Cohan et al. (2018) uses a hierarchical encoder to encode input, paired with a discourse-aware decoder. Ling and Rush (2017) encode document hierarchically and propose coarse-to-fine attention for decoder. Recently, Liu et al. (2019) propose a hierarchical encoder for data-to-text generation which uses LSTM as its cell. Murakami et al. (2017) propose to model stock market time-series data and generate comments. As for incorporating historical background in generation, Robin (1994) proposed to build a draft with essential new facts at first, then incorporate background facts when revising the draft based on functional unification grammars. Different from that, we encode the historical (time dimension) information in the neural datato-text model in an end-to-end fashion. Existing works on data-to-text generation neglect the joint representation of tables' row, column and time dimension information. In this paper, we propose an effective hierarchical encoder which models information from row, column and time dimension simultaneously.

Conclusion
In this work, we present an effective hierarchical encoder for table-to-text generation that learns table representations from row, column and time dimension. In detail, our model consists of three layers, which learn records' representation in three dimension, combine those representations via their sailency and obtain row-level representation based on records' representation. Then, during decoding, it will select important table row before attending to records. Experiments are conducted on ROTOWIRE, a benchmark dataset of NBA games. Both automatic and human evaluation results show that our model achieves the new state-of-the-art performance.