Logic2Text: High-Fidelity Natural Language Generation from Logical Forms

Previous studies on Natural Language Generation (NLG) from structured data have primarily focused on surface-level descriptions of record sequences. However, for complex structured data, e.g., multi-row tables, it is often desirable for an NLG system to describe interesting facts from logical inferences across records. If only provided with the table, it is hard for existing models to produce controllable and high-fidelity logical generations. In this work, we formulate high-fidelity NLG as generation from logical forms in order to obtain controllable and faithful generations. We present a new large-scale dataset, Logic2Text, with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which pose great challenges on the model’s ability to understand the semantics. We experiment on (1) Fully-supervised training with the full datasets, and (2) Few-shot setting, provided with hundreds of paired examples; We compare several popular generation models and analyze their performances. We hope our dataset can encourage research towards building an advanced NLG system capable of natural, faithful, and human-like generation. The dataset and code is available at https://github.com/czyssrs/Logic2Text.


Introduction
Natural language generation (NLG) from structured data has been an important research problem in many applications. Recent data-driven methods have achieved good performances on various NLG tasks (Liu et al., 2018;Freitag and Roy, 2018;Chen et al., 2019b). However most studies focus on surface descriptions of simple record sequences, for example, attribute-value pairs of fixed or very limited schema, like E2E (Novikova et al., 2017) and WikiBio (Lebret et al., 2016). In real-world cases for multi-row tables, it is often more desirable and plausible to provide descriptions involving higherlevel logical inference across data records. For example, in Figure 1, instead of plain restatements, human readers would be more favorable to abstract descriptions that can summarize or conclude information over the table records. To produce such logical-level generations of high fidelity, it is not yet appropriate to provide only the table as the input in a real-world NLG system, based on the following reasons: 1) Low Fidelity. Given only the table, it is challenging for existing neural models to produce such logically correct generations involving reasoning and symbolic calculations, e.g., max, min, counting, averaging, etc.
2) Uncontrollable content selection. Given a table, the space of logically entailed descriptions is exponentially large, due to vast number of combinations of different operations and arguments from the table, e.g., count, comparison, superlative, etc. It is hard and uncontrollable for neural models to decide a valid, favorable choice of logical selections solely based on the table, due to the difficulty of imposing high-level semantic constraints in the compositional generation process.
To combat with the above problems, we argue that it is necessary to leverage intermediate meaning representations to achieve faithful and controllable logical generations. To this end, we formulate the task of logical-level NLG as a logical form to text problem. Specifically, besides the table information, the generation module is provided with a logical form representing the semantics of the target text (see Figure 1 for an example). By separating logical reasoning and language realization, the correctness of the intermediate logical form is guaranteed, and the challenge for the realization module is fully shifted to semantic understanding.
To facilitate research in this direction, we propose a new dataset named LOGIC2TEXT, consisting of 5.6k open-domain tables, 10.8k manually annotated (logical form, description) pairs. Our dataset is of high quality in terms of (1) natural and interesting descriptions; (2) accurate logical forms with 100% execution correctness. In our dataset, the coarse logic types are 7 common ones to describe multi-row tables: count, superlative, comparative, aggregation, majority, unique, and ordinal. We employ a Pythonlike program to serve as our logical forms, which can be easily converted to other types of logical forms. Figure 1 shows two examples of our dataset. Compared with previous surface-level NLG datasets, one major distinction of our dataset is the free schema of the logical forms, which can be represented as diversified graph structures. The new dataset poses great challenges on the model's ability to understand the structural semantics in graph representation.
We employ an array of popular generation models as the baseline approaches. The experiments are conducted in (1) Fully-supervised setting. We train the models using the full dataset to analyze their performances. (2) Few-shot setting. We simulate the low-resource scenario in real-world use cases. Experimental results show that the logical forms are critical to acquiring high-fidelity generations. The pre-trained language model outperforms other baselines (pointergenerator, graph2seq, transformer, etc.), but still makes factual and logical errors.
In summary, our contributions are the following: • We propose a new large-scale dataset, LOGIC2TEXT, with descriptions of common logic types accompanied by the underlying logical forms. The logical forms present diversified graph structures, which raises more challenges on semantic understandings.
• We surveyed several popular generation models as the baselines under fully-supervised and few-shot settings, as well as analyze their pros and cons.
Our dataset can also be used in the reverse way (text to logical form) to facilitate tasks related to semantic parsing. Chen et al. (2019a) propose the task of fact verification against tables, however the performance is greatly limited due to the lack of  the ground truth logical forms. This can be one direct application of our dataset. In this work, we focus on NLG.
2 Related Work NLG from structured data or knowledge has been studied for many years. There are various applications, such as the automatic generations of weather reports (Liang et al., 2009), sport reports (Wiseman et al., 2017), clinical and health reports (DiMarco et al., 2007;Lee, 2018), response generation in task-oriented dialogue systems (Wen et al., 2015;Budzianowski et al., 2018;Dušek et al., 2019), etc. Traditional methods typically employ a pipelinebased approach including content selection, planning and surface realization (Reiter and Dale, 1997;Gatt and Krahmer, 2018). Recent data-driven methods tend to conflate the pipeline modules into one end-to-end neural networks, such as (Liu et al., 2018;Wiseman et al., 2017Wiseman et al., , 2018Gong et al., 2019). Most recently, large-scale pre-trained models (Radford et al., 2019;Song et al., 2019;Raffel et al., 2019) have achieved new state-ofthe-arts on various generation tasks. Chen et al. (2019b) demonstrate that a simple pre-training based method can achieve very reasonable performance on the WikiBio dataset (Lebret et al., 2016) under few-shot setting. More recent works begin to focus on fidelity preserving of the generation, such as (Dhingra et al., 2019;Tian et al., 2019). Their work obtains good performances on surface-level NLG. In contrast, our work focus on the fidelity of logical-level generations.
There are a few popular NLG datasets mostly on surface-level generation. Such as Weath-erGov (Liang et al., 2009), E2E (Novikova et al., 2017, WikiBio (Lebret et al., 2016), andToTTo (Parikh et al., 2020). RotoWire (Wiseman et al., 2017) is a more challenging dataset on generating basketball game reports from multi-row tables. But the reports are still limited to superficial restatements of table records, with very few involving logical inference. Korn et al. (2019) investigate generation of interesting trivia from superlative wikipedia tables. Chen et al. (2020) propose the task of generating arbitrary sentences with logical inference from the table. Their task mainly works for probing purpose, i.e., to test the ability of neural models to produce any logically correct descriptions solely based on the table. However, such a task formulation is not yet appropriate for building a real-world NLG system due to low-fidelity, as we discussed in the introduction. The best-performing model in (Chen et al., 2020) only obtains a factual correctness rate over 20% based on human evaluation, which is clearly far from an acceptable level in real-world systems.
Another line of works related to ours is the text generation from syntactic or semantic sentence structure, such as generation from CCG grammar (White, 2006), UCG grammar (Gardent and Plainfossé, 1990), AMR (Song et al., 2018). There are many early works attempting algorithmic approaches on such kinds of logical formulations (Phillips, 1993;Calder et al., 1989;Shieber et al., 1990;Phillips, 1993), etc. Later proposed datasets include the Groningen Meaning Bank (Bos, 2013), the AMR bank (May, 2016), the DeepBank (Flickinger et al., 2012), etc. In contrast, our work focus on the logical formulations executed on database style tables, and common symbolic operations on tables, such as count, superlative, comparison. As nowadays much of the production data is stored in table based DB, we believe such a dataset should help building systems with table based data.

Dataset Construction
The  (Chen et al., 2019a) to filter out overcomplicated tables and take a subset of tables with less than 20 rows and 10 columns.
In this dataset, we start from 7 types of most commonly used logics (Chen et al., 2019a) to describe multi-row tables: count, superlative, comparative, aggregation, majority, unique, and ordinal. For example, for logic type count, the definition is: counting some rows in the table based on the values in one column, with the scope of all table rows or a subset. Refer to Appendix A for the definitions of all logic types. Each description involves exactly one type of logic. This matches the observation that humans generally do not describe their interested information in tables with over-complicated logics. For logical forms, we use a python-like program, and the function set is an extension of (Chen et al., 2019a). Refer to Appendix B for definitions of all functions.
Our dataset is constructed in 3 stages: §3.1 Description composition and verification, §3.2 Logical form annotation and derivation, §3.3 Logical form execution and verification. We adopt the workflow of composing descriptions first and then deriving the logical forms, because under such an order, the annotators can compose natural descriptions based on the interesting facts in the table, which is hard to be achieved by automatic enumeration of logical forms followed by template re-writing. For all crowd-sourcing tasks we hire Amazon Mechanical Turkers 2 (AMT) under three requirements: (1) from English native countries ("US","CA","GB", "AU"); (2) Approval rate higher than 95% for all HITs; (3) More than 500 approved HITs. We follow the human subject  research protocols 3 to pay the workers. We maintain strict high criterions for approval and review at least 10 random samples for each worker to decide whether to approve or reject all his/her HITs.

Description Composition & Verification
In this first stage, the human workers are asked to compose statements of a certain logic type, that describe interesting facts in the table. It's possible that some logic types cannot be applied to certain tables. Therefore we design the following working procedure: For each table, the 7 logic types are randomly put into three groups (with sizes 2, 2, and 3). The worker is asked to choose one logic type from each group and compose a description based on the chosen logic type. They must follow the requirements (1) try to choose diversified logic types, (2) avoid template-like language and try to compose natural and interesting descriptions, (3) include the information in table captions, so as to compose comprehensive descriptions without unspecified pronouns. An example of the workflow is shown in Figure 2. We provide the workers detailed explanations for each logic type by their corresponding definitions, accompanied by examples. After collecting the descriptions, we add a verification stage to filter out descriptions of low quality. We redistribute the collected descriptions grouped by each logic type, then ask three questions: Is this description (1) of the correct logic type presented?
(2) factually correct? (3) grammatically correct and fluent? We filter out the description if any question receives a negative response.

Logical Form Annotation & Derivation
As the core step of our dataset construction pipeline, we design a workflow to obtain the semantic information via conversations with human workers, then use the information to derive the logical forms. The 3 https://en.wikipedia.org/wiki/ Minimum_wage_in_the_United_States questions in the conversation are specifically designed for each logic type. Here we go through the example of logic type superlative given in Figure 3 to illustrate our annotation process.
The logical form structure prototype is shown in the right grey part, consisting the description of the superlative value, and other mentioned columns on the row with the superlative value. Then we ask the follow-up questions to derive the complete logical form based on the prototype, shown on the left part of Figure 3: Q1. What is the scope of the superlative operation? If the scope is a subset of all table rows, we perform another round of conversation to annotate the scope. Q2. What is the table column of the superlative operation? Q3. What is the specific type of the superlative operation: maximum or minimum. Q4. What is the table row with the superlative value. Q5. Is the superlative value itself mentioned in the description or not? Q6. What are the other columns mentioned in the description? After collecting the answers of the above questions, we can derive the logical form, as shown in the middle part of Figure 3.
We provide the workers with detailed explanations of the prototype for each logical types, as well as several examples. Note that the prototype covers most, but not all of the logical descriptions due to their diverse nature. Thus we also provide the option to skip the example if it cannot be formulated by the given question set. Check Appendix A for the annotation process of other logic types.

Logical Form Execution & Verification
After the collection of logical forms, we use the Stanford CoreNLP toolkits 4 to tokenize all text content (all table information, the descriptions, and the texts in the logical forms). To remove incorrect logical forms, we execute the logical forms and perform another round of semantic verification.  Logical Form Execution The functionality in our logical form is based on the ones used in (Chen et al., 2019a). We extend the function set to deal with semi-structured table cells (dates, mixed numbers and strings, etc.). We execute all logical forms against the corresponding table, and only keeps the ones that evaluate to True. This guarantees that the logical forms in our dataset achieve 100% execution correctness.
Semantic Verification Note that execution correctness does not guarantee semantic correctness. Therefore we perform another round of semantic verification. Since AMT workers do not have experts knowledge to understand the logical forms, we convert the logical form into natural language interpretation based on the operations of each function. We then ask the workers to verify whether the interpretation correctly matches the meaning of the description, with neither insufficient nor redundant information. Then we remove the examples receiving negative responses.
Expert Evaluation To demonstrate the quality of our dataset, we employ two computer science graduate students to conduct evaluations. We randomly sample 200 examples for each logic type to verify the semantic correctness. Each example is examined by both students, and the decision is made after discussion. The result shows that each logic type reaches a correct rate no less than 90%.  Right: average number of all nodes and function nodes in the logical forms for each logic type.

Dataset Statistics and Analysis
We follow a rough ratio of 8:1:1 to split our dataset into 8,566 for training, 1,095 for development, and 1,092 for testing. The train, dev, and test sets have no overlap tables. We show the statistics of the dataset in Table 1 and the distributions of 7 logic types in Figure 4. Each table has 1-3 descriptions with different logic types. Since the logical forms present graph structure nature, we analyze the complexity of the logical forms based on the number of nodes in the graph, regarding the number of function nodes (count, max, etc.) and the number of all nodes (both function nodes and text nodes), respectively. As shown in Figure 5, the logical forms in LOGIC2TEXT have a minimum of 5 nodes and maximum over 14 nodes. For different logic types, comparative has the most number of nodes, because it involves the selection and operation for two table rows. superlative, ordinal, and unqiue primarily focus on one table row, some-times with the scope being a subset of all table rows, which makes the logical forms more complex. count, majority, and aggregation are summarization based logic types on multiple table rows. They are the three relatively simpler ones in terms of logical form structures. Figure 6 gives the logical form structures for 3 example logic types.

Experiments
In this section we first describe the baseline models of our dataset in §5.1; Then we conduct experiments in fully-supervised setting §5.2; We demonstrate the importance of the logical form in §5.3 and perform ablation studies in §5.4; At last we carry out experiments under few-shot setting §5.5.

Baseline Models
Apart from the logical forms serving as the primary input to the generation model, the table informa-tion is also crucial to provide context information. Following human's order to comprehend the table and produce descriptions, the input C is formulated as the sequence of table captions, table headers, table content, and the logical form. The goal is to generate a sequence w that maximize P (w | C): We employ the following models as our baselines for LOGIC2TEXT: Template We manually craft generation templates for each logic type based on the logical form.
Seq2seq+att We employ the seq2seq with an attention model from (Bahdanau et al., 2015). The input sequence is formulated as the concatenation of the table caption, table headers, the linearized table content, and the linearized logical form.
Pointer generator (See et al., 2017) adds the copy mechanism upon the seq2seq with an attention model, allowing the decoder to copy tokens from the input directly. Such a mechanism is known to be critical for fidelity-preserving generation with abundant entities, numbers, etc.
Graph2seq+copy There is a line of research for graph neural network based encoders, such as (Marcheggiani and Perez-Beltrachini, 2018;Xu et al., 2018), etc. We employ one representative model, Graph2seq (Xu et al., 2018), to encode the logical forms. The table caption and headers are first fed into a seq2seq, followed by the graph encoder for the logical form. We also add the copy mechanism to allow copying from the input.
Transformer+copy The popular Transformer model (Vaswani et al., 2017) has shown remarkable progress in many tasks including NLG. In addition to the original Transformer structure, we add the copy mechanism where the last hidden layer is used to calculate the attention score and the copy switch. We also add segment embeddings for different input components, similar as (Devlin et al., 2019).
GPT-2 Generally, with Transformer based structures, recent large-scale pre-trained models have achieved new SOTA results in a wide range of NLP tasks. A typical workflow is to use the pre-trained model as initialization, then fine-tune the model on task-specific data. In this work, we employ the generative pre-training model, GPT-2 (Radford et al., 2019), as one of our baselines.
For all neural models we use Byte-Pair Encoding (BPE) (Sennrich et al., 2016) and the subword vocabulary used in (Radford et al., 2019). Refer to Appendix C for more implementation details.
For models without pre-training, the copy mechanism brings a significant improvement, comparing pointer-generator and seq2seq. This is because the descriptions in our dataset involve much factual information from the table and the logical form, e.g., entity names, and numbers. However, the pre-trained language model GPT-2 can mostly accurately produce these factual terms even without a copy mechanism, demonstrating the powerful prior knowledge obtained from large-scale pre-training.  Compared to the pointer generator, which takes linearized logical form as input, Graph2seq+copy directly models the graph structure and gets a slight improvement. The Transformer+copy model obtains better performance than the Graph2seq+copy model, as the Transformer architecture is indeed a graph neural network with self-attention as aggregation function over the neighbors and regards the input as a fully-connected graph. Recent works (Lin et al., 2019;Rogers et al., 2020;Mager et al., 2020) have shown that Transformer-based structure can capture hierarchical syntactic structures and graph representations. The GPT-2 model obtains the best performance among all with a significantly larger improvement. As a pre-trained language model with the Transformer structure, it combines the strength of both structural modeling and language modeling prior. Some example generations are provided in Appendix E.

Human Evaluation
Automatic scores are not sufficient for precise evaluation of factual and logical correctness. Therefore we conduct human evaluations through (1) crowd-5 Standard script NIST mteval-v13a.pl 6 rouge-1.5.5. sourcing on Amazon Mechanical Turkers (AMT), and (2) human expert evaluations.
For human evaluations on AMT, we randomly sample 500 examples from each of the top best-performing methods (GPT-2 and Trans-former+copy), and the gold references. The evaluations are conducted on two axes: factual correctness and language fluency. For factual correctness, we ask the workers to verify whether the description is factually supported by the table; For language fluency, we conduct pairwise comparisons between different methods. For both evaluations, we distribute each task to 3 workers to eliminate human variance. The evaluation results of language fluency and factual correctness are shown in Table 4 and the first row of   correctness, i.e., whether the generation correctly matches the meaning of the logical form, we invite human experts (two computer science graduate students) to perform the evaluation. We sample 200 examples from each method and ask them to verify whether the description correctly presents the meaning of the logic form. Each example is examined by both students, and the decision is made after discussion. The second row of Table 3 shows the evaluation results.
As we can observe from all evaluation results, the GPT-2 model gives big improvements on both fidelity preserving and language fluency, but there's still a gap, especially on semantic correctness. We believe our dataset can serve as a valuable resource posing such a challenge on high-fidelity generation with complex semantics.

Importance of the Logical Form
We conduct experiments without using the logical form, i.e., to generate arbitrary logically correct descriptions solely based on the table, which is the task setting of (Chen et al., 2020). The generation is evaluated with all descriptions of the same table as multi-references, as in their setting. The best performing model of (Chen et al., 2020) obtains a BLEU-4 score of 20.17 and a factual correctness rate of 20.2% based on human evaluation of 500 samples. In contrast, the generations of our best -performing baseline can obtain a factual correctness rate of 82.4% shown in Table 3, which demonstrates the great importance of the logical form on high-fidelity generation. Note that the automatic scores are not directly comparable, since, in our task setting, each generation maps to a unique logical form and is evaluated with a single reference.  We perform ablation studies on other input components: the table caption, header, and content, using the best-performing GPT-2 model. As shown in Table 5, both the table caption and header provide strong context information for generation, and the table content also brings a slight improvement.

Few-Shot Setting
Considering that acquiring a large amount of (logical form, description) pairs in real-world cases is expensive, we also include a few-shot learning task for our dataset, where the model is only provided with hundreds of paired examples. Previous works have shown that the pre-trained language models obtain strong NLG performance even with a handful of fine-tuning instances (Chen et al., 2019b). Therefore we still use the best-performing GPT-2 model for this study. In our dataset, the amount of unseen logical form structures increases with the reduction of training instances. As shown in Table 6, while there's still a gap with the fully-supervised result, the result with 1,000 training instances using GPT-2 is comparable to some other baselines with the full training data. This demonstrates the potential of incorporating generative pre-training for the few-shot learning task.

Conclusion
In this work, we formulate the problem of logicallevel NLG as generation from logical forms in order to obtain controllable and high-fidelity generations. To this end, we propose a new dataset named LOGIC2TEXT. There are some other potential future directions. 1) Human evaluations are precise but expensive. Our dataset can be used in the reverse direction to train a semantic parser, to assist parsing-based evaluations. 2) In this work, we primarily focus on the step to generate descriptions based on the logical form. Another potential future direction could be the content selections, i.e., how to select and organize the logical forms to construct a discourse plan based on user interests. (5). Is the compared records itself mentioned in the statement? (6). What are the other column(s) of these two rows mentioned in the statement? Majority: (1). What is the scope of this majority? (2). Which column the statement is describing? (3). Is the statement describing all the records or most frequent records within the scope? (4). Select the criterion, based on which we filter records to describe the majority. Here we consider the following criterion: "equal", "not equal", "less than", "less than or equal to", "greater than", "greater than or equal to", "fuzzily match" (or "other" if none of the above is correct). (5). Based on the selected criterion, write the value to be filtered for describing the majority.  (4). Select the criterion, based on which we filter records in this column to find the unique row. Here we consider the following criterion: "equal", "not equal", "less than", "greater than", "fuzzily match" (or "other" if none of the above is correct). (5). Based on the selected criterion, write the value to be filtered for the unqiue row. (6). On this unique row, what are the other column(s) mentioned (except the column describing the scope)? If not any other column is mentioned, write 'n/a'.

B. Function Definitions
Here we list the function definitions and descriptions for our logical form in  For all neural models we use Byte-Pair Encoding (BPE) (Sennrich et al., 2016) and the subword vocabulary used in (Radford et al., 2019). We use the pre-trained word embeddings from (Radford et al., 2019) and project to certain smaller dimensions (300) as the word embeddings. The batch size of all models are set to 32. The beam size is set to 3. As the table content only serves as context information for generation, to save GPU memory we set the maximum length of the table content as 200. The hyperparameters are chosen based on manual tuning regarding the BLEU score on the validation set.
Seq2seq+att & pointer-generator The learning rate is set to 0.001. For seq2seq, the training takes around 16000 gradient steps. For pointer generator, training takes around 5000 steps. Graph2seq+copy we reuse the code skeleton from the released code from (Xu et al., 2018). The table caption and header are first fed into a seq2seq, then the final hidden state is used to initialize the nodes of the graph encoder. When applying attention and copy, for graph nodes, we concatenate the token embedding and the embedding of its node as the embedding for the token. The learning rate is set to 0.0005. Training takes around 11000 steps. Transformer+copy we mostly follow the structure setting in the original Transformer model (Vaswani et al., 2017). We use 4 attention heads and 6 layers. The final hidden layer is used for calculating the attention score and the copy switch. We also add the segment embeddings for different input components similar as (Devlin et al., 2019). The learning rate is set to 0.0005. training takes around 32000 steps. GPT-2 We use the GPT-2 small 117M model from the released code and pre-trained model from (Radford et al., 2019). Word embeddings are fixed during training. The learning rate is set to 0.0003. The training takes around 500 steps to converge.
All the experiments are run on GeForce GTX 1080Ti GPU. Table 8 shows the validation performance of different baselines.

D. Human Evaluation Details
Human Evaluations on AMT We randomly sample 500 examples from the top two best performing methods (GPT-2 and Transformer+copy), and the gold references. The evaluations are conducted on two axes: factual correctness and language fluency.  For factual correctness, we provide the workers with both the table and the description, and ask them to verify whether the description is factually correct based on the table. If the description contains too many grammar errors to be readable, the worker is instructed to select "incorrect". Minor grammar errors can be accepted, as long as the worker can understand the meanings. For language fluency, we conduct pairwise comparison between the three methods. For this evaluation we only present the pair of descriptions to the worker, and ask them to select a better one only based on language fluency (a better description should be fluent, coherent, and free of grammar errors), or select "Tied" if the two descriptions are of similar quality. For both evaluations we distribute each task to 3 workers to eliminate human variance. Human Expert Evaluation To conduct precise evaluation of semantic correctness, i.e., whether the generation correctly matches the meaning of the logical form, we invite human experts (two computer science graduate students) to perform the evaluation. We sample 200 examples from each method and ask them to verify whether the description correctly presents the meaning of the logic form, with neither insufficient nor redundant information. The description should also be fluent and free of grammar errors. Therefore this evaluation can be seen as a comprehensive evaluation of the generation quality. Each example is examined by both students and the decision is made after discussion.

E. Generation Examples
We provide 2 examples of generations in Figure 8 and Figure 9.