Operation-guided Neural Networks for High Fidelity Data-To-Text Generation

Recent neural models for data-to-text generation are mostly based on data-driven end-to-end training over encoder-decoder networks. Even though the generated texts are mostly fluent and informative, they often generate descriptions that are not consistent with the input structured data. This is a critical issue especially in domains that require inference or calculations over raw data. In this paper, we attempt to improve the fidelity of neural data-to-text generation by utilizing pre-executed symbolic operations. We propose a framework called Operation-guided Attention-based sequence-to-sequence network (OpAtt), with a specifically designed gating mechanism as well as a quantization module for operation results to utilize information from pre-executed operations. Experiments on two sports datasets show our proposed method clearly improves the fidelity of the generated texts to the input structured data.


Introduction
Data-to-text generation is a classic language generation task that takes structured data (e.g., a table of statistics or a set of event records) as input, aiming at automatically producing texts that informatively, correctly and fluently describe the data (Kukich, 1983;Reiter and Dale, 1997;Angeli et al., 2010;Konstas and Lapata, 2012;Perez-Beltrachini and Gardent, 2017). Traditionally, a data-to-text generation system should pay attention to the problem of content selection (i.e., what to say) and surface realization (i.e., how to say) (Reiter and Dale, 1997;Gatt and Krahmer, 2018). Modern neural generation systems avoid the distinction of these aspects by building over a standard encoder-decoder architecture (Sutskever et al., 2014) with the attention mechanism over input content (Bahdanau et al., 2015) and train the * Contribution during internship at Microsoft. edges the Heat with 95 -94 Table 1: An example of generated texts from structured data. In this example, the wining team is not indicated explicitly, but can be inferred from the scores for hte two teams. The words with underlining and ::::: wave :::: lines are based on the facts from the input data and the results of inferring, respectively.

Input Data
whole system in an end-to-end fashion. As a result, end-to-end neural text generation has drawn increasing attention from the natural language research community (Mei et al., 2016;Lebret et al., 2016;Wiseman et al., 2017;Kiddon et al., 2016).
However, a critical issue for neural text generation has been largely overlooked. In domains such as sports, finance or medical care, language generation should adhere to facts which are supported by or can be derived from the input data through analysis or inference. For instance, the sentence "Hawks edges the Heat with 95-94" describing the result of a basketball game should always conform to the original data in team names and the scoreline. More importantly, the word "edges" in the description is an inferred fact that the scores between the two competing teams are rather close, while "Hawks" is the winner that scores the slightly higher point total of "95". Since current neural models do not have special treatment for such data analytics, they are likely to generate spurious and incorrect statements. This problem has already been pointed out in recent studies (Wiseman et al., 2017). Related studies on neural program induction have shown that cur-rent neural models have difficulties in learning arithmetic operations such as addition and comparisons (Joulin and Mikolov, 2015;Neelakantan et al., 2016).
A straightforward way to improve the fidelity of neural text generation is to separate symbolic operations out from the neural models. More specifically, it is viable to pre-execute a few symbolic operations before generation, and then use the results of execution to guide the whole generation process. However, there are two major challenges for incorporating pre-defined operations: (1) if we apply operations exhaustively on all fields with compatible value types in the table, it would create a huge search space in which mention worthy results are rare events and (2) it is difficult to establish the correspondences between specific spans of numeric results and lexical choices. For example, the word "edges" corresponds to the slight difference in score, i.e. 1, in Table. 1.
Inspired by recent work that separates neural representations and symbolic operations (Liang et al., 2017), we propose a framework for neural data-to-text generation that is able to utilize information from pre-computed operations on raw data. Based on a standard sequence-to-sequence model with an attention and copying mechanism, we design a gating mechanism for the neural model to decide which part of the execution results should be used for generation. To address the second challenge, we also design a quantization layer to map numerical execution results into bins to guide different lexical choices according to different quantities of values.
To examine the effectiveness of our proposed model, we collect a large dataset of sports headline generation for NBA basketball games 1 . We also evaluate the models on the ROTOWIRE dataset released by Wiseman et al. (2017) which targets at generating short paragraphs. Experiments show that our model outperforms current state-of-theart neural methods in terms of both fluency and fidelity. In summary, we make the following contributions in this paper: • We propose a neural data-to-text framework that generate texts by additional processing over input data. Based on a basic sequenceto-sequence model with attention and copying, we design a gating mechanism to enable the model to decide which part of the executed results should be utilized. We also propose a novel quantization layer to map specific numerical values onto different spans to affect lexical choices under different conditions.
• To focus our study on correct text generation, we collect a challenging dataset for NBA headline generation.
• We conduct experiments on the NBA headline dataset as well as the ROTOWIRE dataset from previous work. Results show improvements on both correctness and fluency from our proposed framework over baseline systems.
2 Background: Attention-Based Neural Sequence-to-Sequence Model In this section, we briefly introduce the architecture of the attention-based sequence-to-sequence (Seq2Seq) (Cho et al., 2014b;Bahdanau et al., 2015) model with a copy mechanism (See et al., 2017), which is the basis of our proposed model.

RNN Encoder-Decoder
The goal of data-to-text generation is to generate a natural language description for a given set of data records S = {r j } K j=1 . Usually, a Seq2Seq model consists of an encoder and a decoder with recurrent neural networks (RNN). First, each input record r j is encoded into a hidden vector h j with j ∈ {1, ..., K} using a bidirectional RNN. The decoder generates the description word by word using another RNN.
In the training phase, given a record set and its corresponding natural language description (S, y), the Seq2Seq model maximizes the conditional probability as follows: where y t is the t-th word in the description and T is the length of the description. The conditional probability P (y t |y <t , S) is computed as: where f (·) is a non-linear function and d t is the hidden state of the decoder at step t:  Figure 1: A diagram of the operation guided neural data-to-text generation. The input record table is converted from the first 3 columns of Table1. First, a set of operations are applied to the input records. Then, the records, operations and pre-excuted operation results are encoded. Finally, an attention-equipped GRU decoder with a gating mechanism decides which part of the execution results and context should be used for generation.
where g(·) is a non-linear function. We adopt the Gated Recurrent Unit (GRU) (Cho et al., 2014a) as the recurrent unit for the encoder and decoder. c t in Eq. 2 is the context vector at timestep t, computed as a weighted hidden vectors h j : where α t,j is computed by an attention scheme, typically implemented as a softmax distribution over scores calculated with a multi-layer perceptron (Bahdanau et al., 2015).

Copy Mechanism
Recent work augments Seq2Seq models to copy words directly from the source information on which they are conditioned (Gu et al., 2016;See et al., 2017). These models usually introduce an additional binary variable z t into per-timestep target word distribution, which indicates whether the target word y t is copied from the source or is generated from the recurrent hidden states. We use the pointer-generator network (See et al., 2017) for the copy mechanism. Specifically, the binary variable z t is calculated from the context vector c t , the decoder state d t and the decoder input y t−1 : where vectors w c , w d , w y and the scalar b ptr are learnable parameters, and σ is the sigmoid func-tion. The joint probability for generating y t is formulated as follows: P copy (y t |y <t , S) = p gen P (y t |y <t , S) (6)

The Proposed Model
In this paper, we propose to utilize information from pre-executed operations on the input data to guide the generation. As shown in Fig. 1, our model consists of a record encoder, an operation encoder and an operation result encoder, and an attention-equipped GRU decoder with a gating mechanism. First, a set of operations are applied to all valid records in the input data, yielding their corresponding pre-executed results. The preexecuted results act as facts inferred from input data to guide the generation. Then, the records, operation and pre-executed operation results are encoded into corresponding representation. Finally, we design a gating mechanism for the GRU decoder to decide which part of the inferred facts should be used for generation. Moreover, to address the challenge in establishing correspondences between specific numeric results and lexical choices, a quantization layer maps the results into several segmentations to guide the lexical choices.

Notation
Given the input data and description pair (S, y), where each target description y = y 1 , ..., y T con-sists of T words, and each input data is stored in a table (e.g., Table 1), where each row is an entity and each column is a field of this entity. The input data can be transferred into K records where each record r is a triple (r.idx, r.f, r.v). For r 4 in the table of Fig. 1, r.idx r.f and r.v refer to the row index (e.g., row 2), the field name (e.g., column Points) and value (e.g., cell value 95), respectively. We also define a set of operations {op i }, and the operations are applied to the input records S to produce corresponding results at the preprocessing stage. The results of operations can be categorized into two types: op scl i denotes results with a type of scalar value and op idx i denotes results with a type of indexing value.

Encoding Records
We map each record r ∈ S into a vector r by concatenating the embedding of r.idx (e.g., row 2), r.f (e.g., column Points) and r.v (e.g., cell value 95), denoted as r = [e idx , e f , e v ] , where e idx , e f , e v are trainable word embeddings of r.idx, r.f and r.v respectively, similar to (Yang et al., 2017). We feed a set of record vectors r 1 , ..., r K to a bidirectional GRU and yield the final record representations h ctx 1 , ..., h ctx K as introduced in Section 2. We leave the exploring of different encoding methods as future work, as it would affect the performance.

Encoding Operations
As shown in Fig. 1, each operation op i consists of: a) the name of the operation op i .t (e.g., minus); b) the column op i .c to which the operation applies (e.g., Points); and c) the row to which the operation applies, denoted as , where A is the count of arguments. We then encode each operation op i by concatenating the representation of these three components and feed them into a nonlinear layer to represent each operation as follows: where op t i is the embedding of op i .t; op c i is the embedding of column op i .c which shares the same parameters of embedding with record column r.f . For op i .arg, it may contain multiple arguments, so we apply a nonlinear layer to get a fixed length representation as follows: where e idx k is the same embedding as used to encode the row index r.idx, and W arg k and b arg are learnable parameters. For operations which are applied in the entire column (e.g., argmax) without specific rows, the representation of arguments is a special vector which stands for ALL.

Encoding Operation Results
The operations produce two types of results, one is scalar results (e.g., the minus operation returns -1), the other is indexing results (e.g., the argmax operation returns the row number 2), and two encoders are designed to encode these results respectively.
Scalar Results Representation In Table. 1, the word "edges" is generated based on the fact that the points gap of the two teams is -1. In fact, other value likes -2 or -3 is close to -1, and the word "edges" is also applicable to them. However, directly establishing the lexical choices on various sparse numeric values is not easy (Reiter et al., 2005;Smiley et al., 2016;Zarrieß and Schlangen, 2016). Reiter et al. (2005) use consistent data-toword rules for time-series weather forecast summary generation. In this paper, we aim to capture the data-to-word mapping automatically by a simple quantization unit. A quantization layer is designed to map the scalar values into several bins, namely quantization units. Specifically, we feed each scalar value op scl i to a softmax layer, and its representation h res i is computed as the weighted sum of all quantization embeddings: where W q and b q are trainable parameters, e scl is the quantization embedding and L is the size of quantization units. Note that L is much smaller than the unique number of scalar results. We set L to 5 in this paper.
Indexing Results Representation Some operations produce the row number of records (denoted as idx i ) as a result. For instance, the argmax operation in Fig. 1 returns row 2. We then look up the row embedding of the selected record defined in Section 3.2 to represent the result. Defined as h res i = e idx i .

Decoder
Comparing with the Seq2Seq model described in Section 2 and our model, the main difference is in the context vector c t . Different from Eq. 4, our model has both records and operations as input. We design two attention layers to summarize information from both parts respectively, the overall context vector c t is balanced by a dynamic gate λ t .
where c op t and c ctx t are the context vector of operation results and records, respectively.
As there are two types of operation results which have quite different meanings, their context vectors are calculated separately and then put together by a nonlinear layer. The context vectors c scl t of operation results with scalar value at timestep t are constructed as (Luong et al., 2015): where MLP stands for standard 1-layer perceptron (with tanh nonlinearity), and α scl t,j refers to the importance of j-th operations at the current timestep t. Eq. 14 is based on the attention mechanism which can be treated as mapping a query and a set of key-value pairs to an output. The output c scl t is computed as a weighted sum of the values h res j , where the weight assigned to each value is computed by a compatibility function of the query d t−1 with the corresponding key h op j . In this way, we also construct c idx t . Then the context vector of operation results at time step t is computed by putting these two context vectors together: The context vector representation c ctx t for records is constructed by replacing h res j with h ctx j in Eq. 14 and replacing h op j with h ctx j in Eq. 15. After obtaining c t , the word distribution for generation can be calculated by substituting the c t in Eq. 2. For the copy probability defined in Eq. 6, to copy words based on the information of both operations and records at current time step t,  Table 2: Dataset statistics. For each dataset, we also manually label the source for the facts mentioned in 20 descriptions, and report the percentage of facts based on the input data, inferred facts and unsupported facts.
we need to update the attention weights for Eq. 6 based on the newly computed context vector c t and decoding state d t :

Training
As the results of operations are pre-computed in an offline stage, our proposed model is fully differentiable and can be optimized in an end-to-end manner using back propagation. Given the batches of records {S} N and the standard natural language descriptions {Y } N , the objective function is to minimize the negative log-likelihood: where the superscript k indicates the index of the records-description pair, and T k is the length of the k-th description.

Datasets
Several benchmark datasets have been used in recent years for data-to-text generation (Liang et al., 2009;Chen and Mooney, 2008;Lebret et al., 2016). For instance, Lebret et al. (2016) have built a biography generation dataset from Wikipedia. However, a recent study by Perez-Beltrachini and Gardent (2017) shows that existing datasets have a few missing properties such as lacking syntactic and semantic diversity. To check whether the facts mentioned in the descriptions are based on input data, we identify the text spans which contain facts (e.g., in table 1, "Hawks" is a span contain fact) from the descriptions and divide each span into three categories: a) input facts (facts that can be directly found from the input), b) inferred facts (facts that can not be directly found from the input but can be derived), c) unsupported facts (facts that can not be found or derived from input data). Table 2 shows that WikiBio dataset requires inference on only 5.4% of its data. To better demonstrate the effectiveness of our approach, we adopt the following datasets which require substantially more inference based on the input data: ROTOWIRE We use the dataset and its standard splits released by Wiseman et al. (2017), which consists of 4,853 human written NBA basketball game summaries aligned with their corresponding game statistics. Table 2 shows that 11.7% of facts in the game summaries can be inferred based on the input data. However, this dataset focuses on generating long text and 27.1% of facts are unsupported 2 , which brings difficulties to the analysis of fidelity for the generated text.
ESPN We collect 15,054 NBA game result headlines during 2006-2017 from the ESPN website, paired with their corresponding game statistics. These headlines are professional and concise, e.g., the description in Fig. 1. The percentage of inferred facts is 29.1% while unsupportive facts is only 8%, so we can focus on generation for the inferred facts. We split the dataset into 12,043 (80%) for training, 1,505 (10%) for development and 1,506 (10%) for testing respectively.

Instantiation
In the following experiments, we define two operations, the minus operation which returns the scalar result and the argmax operation which returns a id of a row. These operations are applied to all columns and rows whose record values are numeric numbers. The number of pre-executed results increases with the number of operations, arguments and the size of input data, which will impact the efficiency of our model. The unnecessary operation arguments can be pruned, e.g., only apply operations to the arguments co-mentioned in descriptions on the training set. We will leave this part of research for our future work.

Experiment Setup
In the main experiments, we compare our model with the following methods: (a) Template: a problem-specific template-based generator which 2 e.g., injuries, rankings in the league, team schedule, etc. fills structured data into corresponding placeholders to generate texts 3 , (b) Seq2Seq+copy: Seq2Seq model with pointer network copy mechanism introduced in Section 2. It is one of the state-of-the-art methods, (c) Seq2Seq+op: Seq2Seq+copy plus the results of operations, where results are directly treated as extra records and fed to the record encoder introduced in Section 3.2 with the original input together, (d) Seq2Seq+op+quanti: We apply the quantization layer Eq. 9-11 to the results of minus operation on the basis of Seq2Seq+op. For completeness, we also report the results of Wiseman et al. (2017) on the ROTOWIRE dataset. The difference between this baseline and Seq2Seq+copy is that the former uses an LSTM rather than GRU for decoding and an additional copying loss. All the experiments use a beam size of 5 in decoding 4 .
For model training, we use the stochastic gradient descent algorithm and the AdaDelta optimizer (Zeiler, 2012). The dimension of trainable word embeddings are set to 256 except for the dimension of input record row embedding, which is set to 32; and the dimension of hidden units in GRUs are all set to 512. All the parameters are initialized using a normal distribution with zero mean and a variance of 6/(d in + d out ), where d in is the dimension of the input layer and d out is the dimension of the output layer (Glorot and Bengio, 2010). Training converges after 40 epochs.

Main Results
We adopt both automatic evaluation and human evaluation to evaluate the proposed model. Automatic Evaluation We employ BLEU-4 as the metric for automatic evaluation. Table 4 gives the automatic evaluation results for generation on two datasets. Our proposed model OpAtt outperforms neural network baselines (See et al., 2017;Wiseman et al., 2017).
The results show that our method which incorporates the operations enables generating texts that are fidelity to facts and therefore yields the best performance. Seq2Seq+op+quant outper-   forms the baseline method Seq2Seq+copy, but is not as good as our method. The result confirms that our proposed method with specialized operation encoder and gating mechanism utilizes the information of operations more effectively. Moreover, Seq2Seq+op+quant outperforms Seq2Seq+op showing the effectiveness of the quantization layer. Human Evaluation Because of the approximate nature of the automated metric BLEU, we also conduct human evaluation to examine the fidelity of the generated texts. We randomly select some games from testing set, and entrust a professional crowdsourcing company to annotate the generated texts 5 . Specifically, three native English workers who are familiar with NBA games are hired. They are first required to identify the text spans which contain facts from the generated texts, then categorize the text spans into one of three facts listed in Table 2, and finally judge whether the span is supported or contradicted by the input data. Table 3 shows the annotation results. Our method talks more about the inferred facts in the generated texts while includes less contradictions. In addition, all methods produce some unsup-5 The Fleiss' kappa score of the annotation is 0.782 for ESPN and 0.761 for ROTOWIRE respectively. For the ESPN dataset, we select 50 games and each with one generated sentence. For ROTOWIRE, by following (Wiseman et al., 2017), we select 15 games and each with 3 randomly selected sentences.   ported facts which affect the fidelity of the generated texts. We leave this issue for future work.

Analysis
As discussed in Section 4.1, the ESPN dataset is rich in inferred facts. Therefore, the model analysis is based on this dataset, and all case studies are made on the development set.

Effect of Operations
We examine the necessity and the benefit of introducing operations by removing the argmax operation (see "OpAtt w/o argmax op" in Table 5). Comparing to Seq2Seq+copy, the results show that our full model and "OpAtt w/o argmax op" which incorporates results of operations both work well in terms of BLEU, and the improvements increase with the number of operations.
To better illustrate that our model can generate factually correct text, we show the texts generated by different models in Table 6. The game results mentioned in the text generated by the Seq2Seq+copy model are wrong, which shows the inability for existing neural models on inferring facts from the structured data. After adding the minus operation, "OpAtt w/o argmax op" is able to infer the game result by applying the minus operation on the points of the two competing teams, therefore its generated text conforms to the game results. The results confirm the necessity of introducing operations to ensure factually correct generation. Furthermore, our full model generates text with the correct point leader and game result based on the results of operation argmax and operation minus respectively. The quantization layer maps the numerical execution results into several bins to enable different lexical choices according to different quantities of values. Compared to our full model, "OpAtt w/o quantization" in Table 5 which removes the quantization layer decreases the BLEU performance, which shows the effectiveness of the quantization layer in the lexical choices during generation.

Effect of Quantization
In Fig. 2, we visualize the weights of quantization softmax layer µ i,l produced by Eq. 10 when mapping the points gap of two competing teams to five bins. We can see that the points gaps with close numerical values are mapped to the same bin, so the decoder can choose similar words for them in generation. When the absolute value of the points gap is small, the weights distribution over the points gap is dispersive. At this time, the decoder tends to generate general words. This distribution becomes more centralized with the increase of the absolute value of the points gap, to generate more unique words. Moreover, we show the distribution of words that describes the winning relationship of games over different intervals of game points gap. As shown in Table 7, we can clearly see that apart from three most common

Points Gap
Words describes winning relationship [0,5) beat, past, win over, edge, hold off, survive [5,10) beat, past, win over, out last, hold off [10,20) beat, past, win over, blow out, top, pull away, rout >= 20 beat, past, win over, power, rout, easy win over, roll past word "beat", "past", "win over", our proposed quantization layer can choose specific words according to the points gap. The word "edge" and "hold off" will only be chosen when the points gap is small, while the word "rout" and "blow out" will appear when the points gap is larger than 10. We design a gating mechanism to decide when to incorporate the results of operations to guide the process of generation. From Table 5, "OpAtt w/o gate" stands for the method which replaces the balancing weight λ in Eq. 12 to 0.5, which is a special case of our proposed gating mechanism. The performance of this ablation is worse than our full model, which demonstrates that the gating mechanism is an essential component. Fig. 3 shows an example of the gating weights at each time step in generation, where a darker cell means the incorporation of more information from operation results for decoding corresponding word. We can see that the gate weights are reasonable, as the gate values are large when deciding the team leader "horford" and the winner of the game "hawks".
Recent work avoids the distinction of the content selection and sentence realization. Chen and Mooney (2008) use an SMT based approach to learn alignments between comments and their corresponding event records. Angeli et al. (2010) transform the problem into a sequence of local decisions using a log-linear model. Konstas and Lapata (2012) employ a PCFG to simultaneously optimize the content selection and surface realization problem.
In the field of neural text generation, Mei et al. (2016) uses a neural encoder-decoder approach for end-to-end training. Some have focused on conditional language generation based on tables (Yang et al., 2017), short biographies generation from Wikipedia tables (Lebret et al., 2016;Chisholm et al., 2017) and comments generation based on stock prices (Murakami et al., 2017). However, none of these methods consider incorporating the facts that can be inferred from the input data to guide the process of generation. Murakami et al. (2017) post-process the price by extending the copy mechanism and replacing numerical values with defined arithmetic operations after generation. While our model, OpAtt utilizes information from pre-computed operations on raw data to guide the generation.
Our work is related to research areas on deep learning models for program induction and question answering from a knowledge base (Neelakantan et al., 2016;Liang et al., 2017;. Neelakantan et al. (2016) solve the problem of semantic parsing from structured data and generate programs using pre-defined arithmetic operations. Liang et al. (2017) design a set of executable operators and obtain the answers by the generated logic forms.  design a set of operators to generate the latent program for math problem solving. However, data-to-text is a different task. The operations for these methods are designed to find the answers, while we use the operations to guide the process of generation.

Conclusion and Future Work
In this work, we address the problem of generating consistent text from structured data in a neural data-to-text generation framework, where we extract facts that can be inferred in the given data by applying several executable symbolic operations to guide the generation. Moreover, we design a special quantization layer to operations whose re-sult type is numeric value and establish the correspondence between the numeric values and lexical choice in generation. Experiments show that our method, OpAtt, outperforms existing state-ofthe-art neural methods, in both fluency and fidelity evaluations.
As applying operations on a large number of records greatly increases the search space for the attention mechanism, we will extend our model to automatically detect the relevant operations to reduce computing complexity. We will also extend the set of operations to accommodate historical data, graph data and detect the unsupported facts in the generation within the single framework.