Learning to Select, Track, and Generate for Data-to-Text

We propose a data-to-text generation model with two modules, one for tracking and the other for text generation. Our tracking module selects and keeps track of salient information and memorizes which record has been mentioned. Our generation module generates a summary conditioned on the state of tracking module. Our proposed model is considered to simulate the human-like writing process that gradually selects the information by determining the intermediate variables while writing the summary. In addition, we also explore the effectiveness of the writer information for generations. Experimental results show that our proposed model outperforms existing models in all evaluation metrics even without writer information. Incorporating writer information further improves the performance, contributing to content planning and surface realization.


Introduction
Advances in sensor and data storage technologies have rapidly increased the amount of data produced in various fields such as weather, finance, and sports. In order to address the information overload caused by the massive data, datato-text generation technology, which expresses the contents of data in natural language, becomes more important (Barzilay and Lapata, 2005). Recently, neural methods can generate high-quality short summaries especially from small pieces of data .
Despite this success, it remains challenging to generate a high-quality long summary from data (Wiseman et al., 2017). One reason for the difficulty is because the input data is too large for a naive model to find its salient part, i.e., to determine which part of the data should be mentioned.
In addition, the salient part moves as the summary explains the data. For example, when generating a summary of a basketball game (Table 1 (b)) from the box score (Table 1 (a)), the input contains numerous data records about the game: e.g., Jordan Clarkson scored 18 points. Existing models often refer to the same data record multiple times (Puduppully et al., 2019). The models may mention an incorrect data record, e.g., Kawhi Leonard added 19 points: the summary should mention LaMarcus Aldridge, who scored 19 points. Thus, we need a model that finds salient parts, tracks transitions of salient parts, and expresses information faithful to the input.
In this paper, we propose a novel data-totext generation model with two modules, one for saliency tracking and another for text generation. The tracking module keeps track of saliency in the input data: when the module detects a saliency transition, the tracking module selects a new data record 1 and updates the state of the tracking module. The text generation module generates a document conditioned on the current tracking state. Our model is considered to imitate the human-like writing process that gradually selects and tracks the data while generating the summary. In addition, we note some writer-specific patterns and characteristics: how data records are selected to be mentioned; and how data records are expressed as text, e.g., the order of data records and the word usages. We also incorporate writer information into our model.
The experimental results demonstrate that, even without writer information, our model achieves the best performance among the previous models in all evaluation metrics: 94.38% precision of relation generation, 42.40% F1 score of content selection, 19.38% normalized Damerau-Levenshtein Distance (DLD) of content ordering, and 16.15% of BLEU score. We also confirm that adding writer information further improves the performance.
Neural generation methods have become the mainstream approach for data-to-text generation. The encoder-decoder framework Sutskever et al., 2014) with the attention (Bahdanau et al., 2015;Luong et al., 2015) and copy mechanism (Gu et al., 2016;Gulcehre et al., 2016) has successfully applied to data-to-text tasks. However, neural generation methods sometimes yield fluent but inadequate descriptions (Tu et al., 2017). In data-to-text generation, descriptions inconsistent to the input data are problematic.
Recently, Wiseman et al. (2017) introduced the ROTOWIRE dataset, which contains multisentence summaries of basketball games with boxscore (Table 1). This dataset requires the selection of a salient subset of data records for generating descriptions. They also proposed automatic evaluation metrics for measuring the informativeness of generated summaries. Puduppully et al. (2019) proposed a two-stage method that first predicts the sequence of data records to be mentioned and then generates a summary conditioned on the predicted sequences. Their idea is similar to ours in that the both consider a sequence of data records as content planning. However, our proposal differs from theirs in that ours uses a recurrent neural network for saliency tracking, and that our decoder dynamically chooses a data record to be mentioned without fixing a sequence of data records.

Memory modules
The memory network can be used to maintain and update representations of the salient information Sukhbaatar et al., 2015;Graves et al., 2016). This module is often used in natural language understanding to keep track of the entity state (Kobayashi et al., 2016;Hoang et al., 2018;Bosselut et al., 2018).
Recently, entity tracking has been popular for generating coherent text (Kiddon et al., 2016;Ji et al., 2017;Clark et al., 2018). Kiddon et al. (2016) proposed a neural checklist model that updates predefined item states. Ji et al. (2017) proposed an entity representation for the language model. Updating entity tracking states when the entity is introduced, their method selects the salient entity state.
Our model extends this entity tracking module for data-to-text generation tasks. The entity tracking module selects the salient entity and appropriate attribute in each timestep, updates their states, and generates coherent summaries from the selected data record.

Data
Through careful examination, we found that in the original dataset ROTOWIRE, some NBA games have two documents, one of which is sometimes in the training data and the other is in the test or validation data. Such documents are similar to each other, though not identical. To make this dataset more reliable as an experimental dataset, we created a new version.
We ran the script provided by Wiseman et al. (2017), which is for crawling the ROTOWIRE website for NBA game summaries. The script collected approximately 78% of the documents in the original dataset; the remaining documents disappeared. We also collected the box-scores associated with the collected documents. We observed that some of the box-scores were modified compared with the original ROTOWIRE dataset.
The collected dataset contains 3,752 instances (i.e., pairs of a document and box-scores). However, the four shortest documents were not summaries; they were, for example, an announcement about the postponement of a match. We thus deleted these 4 instances and were left with 3,748 instances. We followed the dataset split by Wiseman et al. (2017) to split our dataset into training, development, and test data. We found 14 instances that didn't have corresponding instances in the original data. We randomly classified 9, 2, and 3 of those 14 instances respectively into training, development, and test data. Finally, the sizes of  Zt Table 2: Running example of our model's generation process. At every time step t, model predicts each random variable. Model firstly determines whether to refer to data records (Z t = 1) or not (Z t = 0). If random variable Z t = 1, model selects entity E t , its attribute A t and binary variables N t if needed. For example, at t = 202, model predicts random variable Z 202 = 1 and then selects the entity JABARI PARKER and its attribute PLAYER PTS.
Given these values, model outputs token 15 from selected data record.
our training, development, test dataset are respectively 2,714, 534, and 500. On average, each summary has 384 tokens and 644 data records. Each match has only one summary in our dataset, as far as we checked. We also collected the writer of each document. Our dataset contains 32 different writers. The most prolific writer in our dataset wrote 607 documents. There are also writers who wrote less than ten documents. On average, each writer wrote 117 documents. We call our new dataset ROTOWIRE-MODIFIED. 2

Saliency-Aware Text Generation
At the core of our model is a neural language model with a memory state h LM to generate a summary y 1:T = (y 1 , . . . , y T ) given a set of data records x. Our model has another memory state h ENT , which is used to remember the data records that have been referred to. h ENT is also used to update h LM , meaning that the referred data records affect the text generation. Our model decides whether to refer to x, which data record r ∈ x to be mentioned, and how to express a number. The selected data record is used to update h ENT . Formally, we use the four variables: 1. Z t : binary variable that determines whether the model refers to input x at time step t (Z t = 1). 2. E t : At each time step t, this variable indicates the salient entity (e.g., HAWKS, LEBRON JAMES). 3. A t : At each time step t, this variable indicates the salient attribute to be mentioned (e.g., PTS). 4. N t : If attribute A t of the salient entity E t is a numeric attribute, this variable determines if a value in the data records should be output in Arabic numerals (e.g., 50) or in English words (e.g., five).
To keep track of the salient entity, our model predicts these random variables at each time step t through its summary generation process. Running example of our model is shown in Table 2 and full algorithm is described in Appendix A. In the following subsections, we explain how to initialize the model, predict these random variables, and generate a summary. Due to space limitations, bias vectors are omitted.
Before explaining our method, we describe our notation. Let E and A denote the sets of entities and attributes, respectively. Each record r ∈ x consists of entity e ∈ E, attribute a ∈ A, and its value x[e, a], and is therefore represented as r = (e, a, x[e, a]). For example, the boxscore in Table 1 has a record r such that e = ANTHONY DAVIS, a = PTS, and x[e, a] = 20.

Initialization
Let r denote the embedding of data record r ∈ x. Letē denote the embedding of entity e. Note that e depends on the set of data records, i.e., it depends on the game. We also use e for static embedding of entity e, which, on the other hand, does not depend on the game.
Given the embedding of entity e, attribute a, and its value v, we use the concatenation layer to combine the information from these vectors to produce the embedding of each data record (e, a, v), denoted as r e,a,v as follows: where ⊕ indicates the concatenation of vectors, and W R denotes a weight matrix. 3 We obtainē in the set of data records x, by summing all the data-record embeddings transformed by a matrix: where W A a is a weight matrix for attribute a. Sinceē depends on the game as above,ē is supposed to represent how entity e played in the game.
To initialize the hidden state of each module, we use embeddings of <SOD> for h LM and averaged embeddings ofē for h ENT .

Saliency transition
Generally, the saliency of text changes during text generation. In our work, we suppose that the saliency is represented as the entity and its attribute being talked about. We therefore propose a model that refers to a data record at each timepoint, and transitions to another as text goes.
To determine whether to transition to another data record or not at time t, the model calculates the following probability: where is high, the model transitions to another data record.
When the model decides to transition to another, the model then determines which entity and attribute to refer to, and generates the next word (Section 4.3). On the other hand, if the model decides not transition to another, the model generates the next word without updating the tracking states

Selection and tracking
When the model refers to a new data record (Z t = 1), it selects an entity and its attribute. It also tracks the saliency by putting the information about the selected entity and attribute into the memory vector h ENT . The model begins to select the subject entity and update the memory states if the subject entity will change. Specifically, the model first calculates the probability of selecting an entity: where E t−1 is the set of entities that have already been referred to by time step t, and s is defined as s = max{s : s ≤ t − 1, e = e s }, which indicates the time step when this entity was last mentioned. The model selects the most probable entity as the next salient entity and updates the set of entities that appeared (E t = E t−1 ∪ {e t }).
If the salient entity changes (e t = e t−1 ), the model updates the hidden state of the tracking model h ENT with a recurrent neural network with a gated recurrent unit (GRU; Chung et al., 2014): Note that if the selected entity at time step t, e t , is identical to the previously selected entity e t−1 , the hidden state of the tracking model is not updated. If the selected entity e t is new (e t ∈ E t−1 ), the hidden state of the tracking model is updated with the embeddingē of entity e t as input. In contrast, if entity e t has already appeared in the past (e t ∈ E t−1 ) but is not identical to the previous one (e t = e t−1 ), we use h ENT s (i.e., the memory state when this entity last appeared) to fully exploit the local history of this entity.
Given the updated hidden state of the tracking model h ENT t , we next select the attribute of the salient entity by the following probability: After selecting a t , i.e., the most probable attribute of the salient entity, the tracking model updates the memory state h ENT t with the embedding of the data record r et,at,x[et,at] introduced in Section 4.1:

Summary generation
Given two hidden states, one for language model h LM t−1 and the other for tracking model h ENT t , the model generates the next word y t . We also incorporate a copy mechanism that copies the value of the salient data record x[e t , a t ].
If the model refers to a new data record (Z t = 1), it directly copies the value of the data record x[e t , a t ]. However, the values of numerical attributes can be expressed in at least two different manners: Arabic numerals (e.g., 14) and English words (e.g., fourteen). We decide which one to use by the following probability: where W N is a weight matrix. The model then updates the hidden states of the language model: where W H is a weight matrix. If the salient data record is the same as the previous one (Z t = 0), it predicts the next word y t via a probability over words conditioned on the context vector h t : Subsequently, the hidden state of language model h LM is updated: where y t is the embedding of the word generated at time step t. 4

Incorporating writer information
We also incorporate the information about the writer of the summaries into our model. Specifically, instead of using Equation (9), we concatenate the embedding w of a writer to h LM t−1 ⊕ h ENT t to construct context vector h t : where W H is a new weight matrix. Since this new context vector h t is used for calculating the probability over words in Equation (10), the writer information will directly affect word generation, which is regarded as surface realization in terms of traditional text generation. Simultaneously, context vector h t enhanced with the writer information is used to obtain h LM t , which is the hidden state of the language model and is further used to select the salient entity and attribute, as mentioned in Sections 4.2 and 4.3. Therefore, in our model, the writer information affects both surface realization and content planning.

Learning objective
We apply fully supervised training that maximizes the following log-likelihood: log p(Y 1:T , Z 1:T , E 1:T , A 1:T , N 1: In our initial experiment, we observed a word repetition problem when the tracking model is not updated during generating each sentence. To avoid this problem, we also update the tracking model with special trainable vectors vREFRESH to refresh these states after our model generates a period:  Table 3: Experimental result. Each metric evaluates whether important information (CS) is described accurately (RG) and in correct order (CO).

Experimental settings
We used ROTOWIRE-MODIFIED as the dataset for our experiments, which we explained in Section 3. The training, development, and test data respectively contained 2,714, 534, and 500 games.
Since we take a supervised training approach, we need the annotations of the random variables (i.e., Z t , E t , A t , and N t ) in the training data, as shown in Table 2. Instead of simple lexical matching with r ∈ x, which is prone to errors in the annotation, we use the information extraction system provided by Wiseman et al. (2017). Although this system is trained on noisy rule-based annotations, we conjecture that it is more robust to errors because it is trained to minimize the marginalized loss function for ambiguous relations. All training details are described in Appendix B.

Models to be compared
We compare our model 5 against two baseline models. One is the model used by subset of all data records, which is predicted in the first stage. Unlike these models, our model uses one memory vector h ENT t that tracks the history of the data records, during generation. We retrained the baselines on our new dataset. We also present the performance of the GOLD and 5 Our code is available from https://github.com/ aistairc/sports-reporter TEMPLATES summaries. The GOLD summary is exactly identical with the reference summary and each TEMPLATES summary is generated in the same manner as Wiseman et al. (2017).
In the latter half of our experiments, we examine the effect of adding information about writers. In addition to our model enhanced with writer information, we also add writer information to the model by Puduppully et al. (2019). Their method consists of two stages corresponding to content planning and surface realization. Therefore, by incorporating writer information to each of the two stages, we can clearly see which part of the model to which the writer information contributes to. For Puduppully et al. (2019) model, we attach the writer information in the following three ways: 1. concatenating writer embedding w with the input vector for LSTM in the content planning decoder (stage 1); 2. concatenating writer embedding w with the input vector for LSTM in the text generator (stage 2); 3. using both 1 and 2 above.
For more details about each decoding stage, readers can refer to Puduppully et al. (2019).

Evaluation metrics
As evaluation metrics, we use BLEU score (Papineni et al., 2002) and the extractive metrics proposed by Wiseman et al. (2017), i.e., relation generation (RG), content selection (CS), and content ordering (CO) as evaluation metrics. The extractive metrics measure how well the relations extracted from the generated summary match the correct relations 6 : -RG: the ratio of the correct relations out of all the extracted relations, where correct relations are relations found in the input data records x. The average number of extracted relations is also reported. -CS: precision and recall of the relations extracted from the generated summary against those from the reference summary. -CO: edit distance measured with normalized Damerau-Levenshtein Distance (DLD) between the sequences of relations extracted from the generated and reference summary.

Results and Discussions
We first focus on the quality of tracking model and entity representation in Sections 6.1 to 6.4, where we use the model without writer information. We examine the effect of writer information in Section 6.5.

Saliency tracking-based model
As shown in Table 3, our model outperforms all baselines across all evaluation metrics. 7 One of the noticeable results is that our model achieves slightly higher RG precision than the gold summary. Owing to the extractive evaluation nature, the generated summary of the precision of the relation generation could beat the gold summary performance. In fact, the template model achieves 100% precision of the relation generations. The other is that only our model exceeds the template model regarding F1 score of the content selection and obtains the highest performance of content ordering. This imply that the tracking model encourages to select salient input records in the correct order.

Qualitative analysis of entity embedding
Our model has the entity embeddingē, which depends on the box score for each game in addition to static entity embedding e. Now we analyze the difference of these two types of embeddings.
We present a two-dimensional visualizations of both embeddings produced using PCA (Pearson, tracted from the summaries, and the corresponding attributes (e.g., "TEAM NAME", "PTS") found in the box-or line-score. The precision and the recall of this extraction model are respectively 93.4% and 75.0% in the test data. 7 The scores of Puduppully et al. (2019)'s model significantly dropped from what they reported, especially on BLEU metric. We speculate this is mainly due to the reduced amount of our training data (Section 3). That is, their model might be more data-hungry than other models.  Figure 1, which is the visualization of static entity embedding e, the topranked players are closely located.

1901). As shown in
We also present the visualizations of dynamic entity embeddingsē in Figure 2. Although we did not carry out feature engineering specific to the NBA (e.g., whether a player scored double digits or not) 8 for representing the dynamic entity embeddingē, the embeddings of the players who performed well for each game have similar representations. In addition, the change in embeddings of the same player was observed depending on the box-scores for each game. For instance, Le-Bron James recorded a double-double in a game on April 22, 2016. For this game, his embedding is located close to the embedding of Kevin Love, who also scored a double-double. However, he did not participate in the game on December 26, 2016. His embedding for this game became closer to those of other players who also did not participate.

Duplicate ratios of extracted relations
As Puduppully et al. (2019) pointed out, a generated summary may mention the same relation multiple times. Such duplicated relations are not favorable in terms of the brevity of text. Figure 3 shows the ratios of the generated summaries with duplicate mentions of relations in the development data. While the models by Wiseman et al. (2017)   tively showed 36.0% and 15.8% as duplicate ratios, our model exhibited 4.2%. This suggests that our model dramatically suppressed generation of redundant relations. We speculate that the tracking model successfully memorized which input records have been selected in h ENT s . Specifically, the description about DERRICK ROSE's relations, "15 points, four assists, three rounds and one steal in 33 minutes.", is also used for other entities (e.g., JOHN HENSON and WILLY HERNAGOMEZ). This is because Puduppully et al. (2019)'s model has no tracking module unlike our model, which mitigates redundant references and therefore rarely contains erroneous relations.

Qualitative analysis of output examples
However, when complicated expressions such as parallel structures are used our model also generates erroneous relations as illustrated by the underlined sentences describing the two players who scored the same points. For example, "11-point efforts" is correct for COURTNEY LEE but not for DERRICK ROSE. As a future study, it is necessary to develop a method that can handle such complicated relations.

Use of writer information
We first look at the results of an extension of Puduppully et al. (2019)'s model with writer information w in Table 4. By adding w to content planning (stage 1), the method obtained improvements in CS (37.60 to 47.25), CO (16.97 to 22.16), and BLEU score (13.96 to 18.18). By adding w to the component for surface realization (stage 2), the method obtained an improvement in BLEU score (13.96 to 17.81), while the effects on the other metrics were not very significant. By adding w to both stages, the method scored the highest BLEU, while the other metrics were not very different from those obtained by adding w to stage 1. This result suggests that writer information contributes to both content planning and surface realization when it is properly used, and improvements of content planning lead to much better performance in surface realization.
Our model showed improvements in most metrics and showed the best performance by incor-  in the first quarter. However, New York by just a 25-foot lead at the end of the first quarter, the Bucks were able to pull away, as they outscored the Knicks by a 59-46 margin into the second. 45 points in the third quarter to seal the win for New York with the rest of the starters to seal the win. The Knicks were led by Giannis Antetokounmpo, who tallied a game-high 27 points, to go along with 13 rebounds, four assists, three blocks and a steal. The game was a crucial night for the Bucks' starting five, as the duo was the most effective shooters, as they posted Milwaukee to go on a pair of low low-wise (Carmelo Anthony) and Malcolm Brogdon. Anthony added 11 rebounds, seven assists and two steals to his team-high scoring total. Jabari Parker was right behind him with 15 points, four rebounds, three assists and a block. Greg Monroe was next with a bench-leading 18 points, along with nine rebounds, four assists and three steals. Brogdon posted 12 points, eight assists, six rebounds and a steal. Derrick Rose and Courtney Lee were next with a pair of {11 / 11} -point efforts. Rose also supplied four assists and three rebounds, while Lee complemented his scoring with three assists, a rebound and a steal. John Henson and Mirza Teletovic were next with a pair of {two / two} -point efforts. Teletovic also registered 13 points, and he added a rebound and an assist. Jason Terry supplied eight points, three rebounds and a pair of steals. The Cavs remain in last place in the Eastern Conference's Atlantic Division. They now head home to face the Toronto Raptors on Saturday night.
(b) Our model porating writer information w. As discussed in Section 4.5, w is supposed to affect both content planning and surface realization. Our experimental result is consistent with the discussion.

Conclusion
In this research, we proposed a new data-to-text model that produces a summary text while tracking the salient information that imitates a humanwriting process. As a result, our model outperformed the existing models in all evaluation measures. We also explored the effects of incorporating writer information to data-to-text models. With writer information, our model successfully generated highest quality summaries that scored 20.84 points of BLEU score.