Logical Natural Language Generation from Open-Domain Tables

Neural natural language generation (NLG) models have recently shown remarkable progress in fluency and coherence. However, existing studies on neural NLG are primarily focused on surface-level realizations with limited emphasis on logical inference, an important aspect of human thinking and language. In this paper, we suggest a new NLG task where a model is tasked with generating natural language statements that can be logically entailed by the facts in an open-domain semi-structured table. To facilitate the study of the proposed logical NLG problem, we use the existing TabFact dataset~(CITATION) featured with a wide range of logical/symbolic inferences as our testbed, and propose new automatic metrics to evaluate the fidelity of generation models w.r.t. logical inference. The new task poses challenges to the existing monotonic generation frameworks due to the mismatch between sequence order and logical order. In our experiments, we comprehensively survey different generation architectures (LSTM, Transformer, Pre-Trained LM) trained with different algorithms (RL, Adversarial Training, Coarse-to-Fine) on the dataset and made following observations: 1) Pre-Trained LM can significantly boost both the fluency and logical fidelity metrics, 2) RL and Adversarial Training are trading fluency for fidelity, 3) Coarse-to-Fine generation can help partially alleviate the fidelity issue while maintaining high language fluency. The code and data are available at https://github.com/wenhuchen/LogicNLG.


Introduction
Neural network models, especially the recent wave of massive models like BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), have shown the ability to generate natural language text at an astonishing level of fluency and coherence.For the generated text to fulfill its purpose, however, a crit-  -to-text generation examples with and without implicit logical inference.Logical NLG requires a generation model to generate natural language statements that can be logically entailed by the facts in the table instead of simply restating certain superficial facts in natural language.ical property that is necessary but often overlooked is fidelity, i.e., what is generated should be faithful to the underlying data, knowledge, or meaning representation.A line of recent work has started to address the surface-level fidelity issue of natural language generation (NLG) by encouraging the model to learn to reuse the verbatim of certain inputs through copy mechanism (See et al., 2017;Gu et al., 2016;Wiseman et al., 2017;Liu et al., 2018), structured attention (Liu et al., 2018), or planning and selection/entity modeling (Puduppully et al., 2019a,b).While shown to be effective, most such methods so far are primarily focused on surfacelevel realization and simply restate the facts in the underlying data (Figure 1).However, humans have the ability to generalize beyond superficial facts (e.g., "Canada has got 3 gold medals.")by inferring and communicating with new statements that can be entailed from these facts (e.g., "Canada obtained the most gold medals.").We believe it is important for NLG models to be able to generalize beyond the superficla facts given to them as well.Therefore, we propose a new task, logical NLG, where a model is tasked Figure 2: When making the decision at the third step, the model needs to foresee the future tokens to ensure logical consistency.There is no back-tracking once the model makes a wrong decision like "5".
with generating natural language statements that can be logically entailed by the given data (i.e., the premises).The new task requires a model to jointly reason and generate sentences that are consistent both linguistically and logically.Since there are a variety of reasoning/inference tasks such as natural language inference (Bowman et al., 2015) and commonsense reasoning (Talmor et al., 2019), to avoid confusion, this paper is specifically focused on inferences involving symbolic operations over the given table (Pasupat and Liang, 2015).
To empower research in this direction, we collect a new corpus LOGICNLG based on the existing TabFact (Chen et al., 2019), which brings two major renovations to the existing NLG paradigm: 1) the text involves diversified types of logical inferences including math operations like max/min/sum/add, comparison operations like same/different, and counting operations like total/only.A more detailed description of logical inference is listed in the Appendix.2) while existing datasets are often restricted to a specific domain such as weather (Liang et al., 2009), restaurant (Dušek et al., 2019), NBA (Wiseman et al., 2017), etc, LOGICNLG uses open-domain tables without prior knowledge about their schema.As such, existing methods based on surface-level copying (See et al., 2017;Gu et al., 2016;Puduppully et al., 2019a) becomes insufficient, so are the existing fidelity evaluation based on the surfacelevel information extraction (Wiseman et al., 2017;Rohrbach et al., 2018;Dhingra et al., 2019), which extracts surface triples in a certain pre-defined form (i.e.subj-pred-obj, n-gram) and compare them with the surface content given in the knowledge.
Most neural generation models follow a monotonic generation schema from left to right with the current prediction only depending on the preceding words.Logical NLG poses unique challenges to the traditional generation scheme due to the mismatch between sequence order and logical order.As illustrated in Figure 2, the word "2" is derived from the logical inference of 'diff(Silver medal of Colombia, Silver medal of Canada)) → 2.' In other words, the logical order of word "2" should be after "more", "silver", and "Canada", while the sequence order of "2" is before those words.Since the monotonic generation scheme is purely based on sequence order while agnostic to logical order, existing NLG models struggle to maintain the fidelity as they cannot model the logical dependency on future tokens.To alleviate such an order mismatch, an NLG model must have the capability to plan ahead for the next few steps before generation.In this context, we believe LOGICNLG to be an important testbed to study such a planing/inference ability in generation models (Ford et al., 2018;Welleck et al., 2019).In this paper, we further propose a non-monotonic coarse-to-fine generation model and show that it is able to alleviate the order mismatch problem and achieve better performance.The contribution of this work is three-fold: i) We propose a new research problem of logical natural language generation, and provide novel metrics to approximately evaluate the logical fidelity of generation models.
ii) We justify the mismatch problem between sequence order and logical order of the traditional monotonic generation scheme in logical NLG.
iii) We conduct comprehensive experiments with state-of-the-art neural generation models under both automatic and human evaluation, which demonstrates the challenges and opportunities for future research on logic NLG.

Dataset and Problem Definition
Existing NLG datasets (Chen and Mooney, 2008;Dušek et al., 2019;Lebret et al., 2016;Liang et al., 2009) are mainly composed of surface-level description over the given records.Though RO-TOWIRE (Wiseman et al., 2017) involves sporadic inference in the long document, and the inference is restricted to domain-specific knowledge (e.g.double-double, smash, triple-double and other NBA-related terms).Hence, we need a better testbed for studying the proposed problem.

Statistics
We construct a dataset based on Tab-Fact (Chen et al., 2019), which is a table-based factchecking dataset with rich logical inferences in the annotated statements.Specifically, we took their positive statements (the sentences which are en- i) It involves very rich logical inference, every annotated sentence involves certain types of inference with minimum domain-specific knowledge.The open-domain characteristic simulates a realistic setting, where we cannot enumerate the possible inference based on the scheme, which poses great challenges to the model's generalization capability.
ii) It is mainly composed of short sentences with an average length of 11 and a simple syntactic structure, which isolates from other linguistic complexity to focus on the problem of logical inference.
The dataset contains tables with open schema crawled from diversified domains Figure 4.The major categories are sports, politics, and entertainment.The schema diversity of the tables make the rule-based system infeasible to apply.Besides, most of the tables have very rich numeral records, which provide a great testbed for logical inference.
Problem Definition Here, we formally define our proposed table-to-text generation task.The input is a table T with its title denoted as a natural language sequence W .The table the T ij being the content in the (i, j)-th cell.T ij could be a word, a number, a phrase or even a natural language sentence.The annotated statement is a sentence Y = y 1 , y

Automatic Evaluation
In this section, we discuss the evaluation of our proposed NLG task.The fluency evaluation is simply based on the standard metrics like Perplexity (Bengio et al., 2003) and BLEU-1,2,3 (Papineni et al., 2002) based on NLTK (Bird, 2006).The most challenging problem is to evaluate the logical fidelity of the generated sentences, which is also the core problem of our paper.The existing IE-based extractive evaluation (Wiseman et al., 2017) leads to two issues as shown in Figure 3: 1) Empty Extraction: the sentence can not be formulated as (subject, predicate, object) structure, thus the IE system fail to extract triples for verification.2) False Negative: the sentence is a logical composition (instead of surface form) of the fact from the table, the IE system cannot match it against the table.For these reasons, we test two approximate automatic metrics:

Parsing-based Evaluation
We first propose a model-based evaluation method, which aims to directly extract the meaning representation from the generated sentence and execute it against the table to verify its correctness.Our evaluation is based on weakly-supervised semantic parsing (Liang et al., 2009(Liang et al., , 2013)), the basic idea is to first link entities and predicates in the sentence, and then use linked entities to perform a breadth-first search to synthesize potential logical forms, finally, a scorer is used to re-rank these logical forms and filter out spurious ones.The logical form returns a binary value of True to indicate whether its logic is supported by the knowledge.The basic idea is shown in the upper part of Figure 5, the implementation details are in the Appendix.We pre-train the semantic parser f γ on the training set (T, Y ) ∈ D train with weakly supervised algorithm, at test time, we use it to parse a sentence Y into a set of logical forms, which is re-ranked to obtain the highest logical form P best .We compute the ratio of P best returning "true" on D test to approximate model's fidelity.
where I is the indicator function.

NLI-based Evaluation
We then propose another model-based evaluation method to complement the parsing-based evaluation (which is sensitive to semantic variation), the basic idea follows (Kryściński et al., 2019) to evaluate the entailment score between the table and the generated sentence.The NLI model is based on Table -BERT (Chen et al., 2019), which linearizes the table into textual form and uses it as the evidence for natural language inference.The model is trained with TabFact (Chen et al., 2019) dataset containing both positive/negative samples.During the evaluation, we use this NLI model to predict the entailment relationship based on the likelihood of Finally, we compute the ratio of "entailed" to approximate model's fidelity: where I is the indicator function.
Adversarial Evaluation Adversarial evaluation (Goodfellow et al., 2014;Kannan and Vinyals, 2017) is used to study the generation model's robustness in logical reasoning.Specifically, we hire human workers from Amazon Mechanical Turk1 to annotate adversarial examples for the test/validation set by simply changing minimum words to revert the logic of the sentence.Such adversarial examples preserve linguistic components like length and style except the logic-related words to specifically disentangle the generation model's reasoning skill.As drawn in the lower part of Figure 5, the original sentence modifies its word "more" into "less" as an adversarial example.
There are two principles the workers need to follow to make their jobs accepted: 1) the modified words/phrases should be roughly equally frequent to balance the language prior, for example, the number "1" is better swapped with "2,3" rather than "9999" which rarely appears in the corpus.2) the perturbation should be diverse enough to cover different aspects of logical reasoning skills.We use the generation model p(Y |T; β) to score the original sentence Y and the adversarial sentence Y adv .
If the confidence of the original example is higher than its adversarial counterpart, we count it as a successful defense, otherwise as a failed defense.
We use the success rate to approximate model's logical reasoning capability.
where I is the indicator function.
Discussion Both types of metrics have pros and cons, the SP-Acc and NLI-Acc are two metrics unbiased as it measures the peak samples in the model's likelihood, however, both metrics are based on imperfect models and thus their evaluation scores are inaccurate.SP-Acc is more sensitive to number/calculation errors, and NLI-Acc is more sensitive to semantic errors, therefore, we report both of them to help increase the metrics' robustness.In contrast, the adversarial evaluation score is accurate in terms of reflecting the model's reasoning capability on the given samples.However, as the provided samples might not lie in the high-confidence area of the model's distribution, it is biased in reflecting the model's general reasoning capability.Though these fidelity metric models are prone to errors, in section 6, we show their consistency with human judgment, which reveals their potential to assist human evaluation.

Baselines
In this section, we design comprehensive baseline models to perform logical NLG.Specifically, we consider the following two cases: non-pretrained models (LSTM/Transformer) with copy mechanism and pre-trained models (GPT-2 and BERT) with sub-word unit.We train these models with three different algorithms: Maximum Likelihood, Adversarial Training, and Reinforcement Learning.

Non-pretrained Models
Here we mainly consider two table encoding methods, namely field-infusing and field-gating.These two methods differ in their strategies to coalesce the field information into cells.After the table is represented as a sequence of vectors, a decoder based on LSTM (Hochreiter and Schmidhuber, 1997) or Transformer (Vaswani et al., 2017) is applied to generate text token by token.The two methods are depicted in the upper part of Figure 6: Field-Infusing This strategy is inspired by Lebret et al. (2016).We first use an LSTM (Hochreiter and Schmidhuber, 1997) to encode the table field text word by word and then use the last output z i as field representation.This representation is concatenated with the embedding of row index #j and word embedding at each cell to obtain a position-aware cell embedding e k for each word inside the cell.We stack transformers layers on top of the cell embedding to obtain the table representation as h i ∈ R D with D as the dimension.
Field-Gating This strategy is inspired by by Liu et al. (2018).Like the previous strategy, we first use an LSTM (Hochreiter and Schmidhuber, 1997) to obtain field representation z i .The field representation is concatenated with ending distance information as the input to an additional field gate built inside the LSTM as suggested in Liu et al. (2018), such a field gate is used to control whether the current cell is already encoded.Such a mechanism can help LSTM to identify the boundary between different cells to grasp local information.

Pre-trained Models
To further enhance the fluency and resolve the out-of-vocabulary problem, we use pre-trained language models and finetune them on LOGICNLG.Specifically, we consider two models based on GPT-2 (Radford et al., 2019) and BERT (Devlin et al., 2019), respectively, and name them as GPT-TableGen and BERT-TableGen.
GPT-TabGen we directly feed the paragraph P T as the input to the pre-trained GPT-2 model and generate the output sentence Y .We finetune the model on LOGICNLG by maximizing the likelihood of p(Y |P T ; β), with β denoting the parameters of GPT-2 model (Radford et al., 2019).
BERT-TabGen 1) we encode the linearized paragraph P T using the pre-trained BERT model into the source representation h 1 , • • • , h |T| .2) at the i-th time step, we replace all the words in the groundtruth statement Y after i-th time step by <MASK> token and use BERT to encode the partially masked 3) we use an attention layer f θ to obtain the output hidden states ĝi 1 , • • • , ĝi n , where ĝi i is used to predict the word ŷi .We jointly optimize β of BERT and θ to maximize

Prefix
Pre-trained BERT

Multi-Layered Transformer Attention
Given the table of "Tournament Medal Table ".In the 1 st row, the nation is Canada, Gold Medal is 1, Silver Medal is 1, Sports is Ice Hockey.In the 2 nd row, the nation is Mexico, Gold Medal is 2, Silver Medal 3, Sports is Baseball, … Roller Skating.

Field words
Figure 6: The Non-pretrained and Pre-trained generation models, the detailed table is shown in Figure 1.
the likelihood of generating text Y conditioned on the table and the masked partial sentence.As BERT is a bidirectional model, we need to re-encode the target sentence at each step to get g i 1:n .Therefore, the generation is finished with n passes.

Training
Except for the standard maximum likelihood training, we also use the following training algorithms: Adversarial Regularization To encourage the model to ground on the table rather than relying on artificial language priors (Ramakrishnan et al., 2018), we use an adversarial regularization to enhance the maximum likelihood training.Specifically, we first perform entity resolution to locate all the numbers, count, entities in the sentence and then randomly replace them with entities or numbers appearing in the table T. These perturbed samples Y adv are used as adversarial examples to regularize the model's behavior.Formally, we optimize β to maximize the objective: where λ is the controlling hyper-parameter.

Reinforcement Learning
The maximum likelihood training is a fluency-driven objective, which is inconsistent with the goal of logical consistency.To bridge the gap, we view the generation problem from the reinforcement learning perspective to optimize the long-term fidelity.We use the trained semantic parser to assign reward to the policy p(y i |y 1:i−1 ; β).At i-th step, the generator will sample different actions y i and roll-out from i + 1th step to produce a full sequence starting from y i using greedy search.The full sentence receives a binary score r(Y, T) from the semantic parser as reward.Formally, we optimize the objective: where we only use one trajectory to approximate the inner roll-out expectation for efficiency.

Coarse-to-Fine Generation
As discussed before, the baseline models follow the monotonic generation scheme and suffer from the mismatch between sequence order and logical order (Figure 2).In this section, we propose an imperfect remedy for such a situation based on the coarse-to-fine generation paradigm.
Before plunging into technical details, it is helpful to first realize the resemblance between logical NLG and semantic parsing (Dong and Lapata, 2018).Compared to traditional NLG tasks like machine translation and summarization, logical NLG is closer to semantic parsing in the sense that a model may make catastrophic errors that are impossible to be corrected at later steps (Figure 2).Therefore, we take inspiration from semantic parsing models (Dong and Lapata, 2018) that have proven effective in mitigating such errors and propose a coarse-to-fine generation scheme.We break down generation into two phases.In the first phase,  Unlike template-based or delexicalized generation (Reiter and Dale, 1997;Wen et al., 2015), which uses rigid slot filling prone to grammatic errors, our fine-grained generation has the flexibility to modify the surface form of non-slot words, which alleviates the linguistic coherence problem (Sharma et al., 2017).
By decoupling sentence structure generation and entity grounding, our proposed coarse-to-fine scheme could partially alleviate the mismatch problem.For example, the generation of "Canada" is now aware of "more than" in the latter part of the sentence, which exposes the model to more context than standard monotonic models to help make logically consistent decisions though the dependency on the "1" and "Mexico" is still not captured.The proposed two-step generation could be viewed as the first step towards a fully non-monotonic generation model to solve such mismatch problem.

Experiments
In this section, we explain the experimental details and then comprehensively report the automatic evaluation of different generation models and training algorithms.Finally, we will conduct detailed human evaluation and error analysis.

Experiment Setup
For the non-pretrained models, we fix the hidden size of both LSTM and transformer to be 256, the transformer is 3-layered with 4 heads, while LSTM is also 3-layered.We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2e-4 to jointly optimize the parameters and keep the model with the best perplexity on the validation set.During test time, we use a greedy search to generate text and calculate the BLEU-1,2,3 scores with the 5 references from the table.For the pre-trained models, we base our implementation on Huggingface's Transformer (Wolf et al., 2019) for both BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019) with subword unit vocabulary of 30K.During linearization, we found that using the whole table compromises the performance greatly, partly due to 1) over-length issue with pre-trained LM, 2) too much irrelevant information input.Therefore, we propose to use partial table as input, specifically, we run entity linking over the sentences to detect the linked columns of the table and only linearize the partial table as input P T .
Both are finetuned using Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-6.In both adversarial training and reinforcement learning algorithms, we add maximum likelihood objective to stabilize the training, we select the appropriate balancing factor based on the validation Adv-Acc socre.For coarse-to-fine training, we first warm up the model to generate the template sequence and then finetune it on the concatenated full sequence.Model selection is based on the bleu-3 score on validation split.

Experimental Results
We first perform an automatic evaluation to approximately measure the performance of different models and then conduct an in-depth human evaluation to have a better understanding.
Automatic Evaluation: The experimental results are summarized in Table 2, where we comprehensively survey different architectures and training algorithms.For the non-pretrained models, we observe that Transformer is slightly better than LSTM and two different table encoding strategies achieve similar results.In contrast, pre-trained models are much better at lowering the perplexity, besides the generated sentences significantly outperform the non-pretrained models in terms of both fluency and fidelity score with GPT-TabGen and BERT-TabGen achieving similar performance.As the BERT-TabGen runs much slower due to multiple passes of decoding, we favor GPT-TabGen in the following experiments.With the adversarial regularization and reinforcement training, the model can only improve the optimized fidelity metric, with the fluency scores dropping significantly.Such phenomena confirm our assumption about the caveats of the monotonic generation paradigm.For the proposed coarse-to-fine generation scheme, as the "[ENT]" tokens are replaced by entity names, which normally contain a phrase like "Feb 2nd".Such n-gram phrase substitution preserves the completeness of entity names and thus leads to higher 2/3/4-gram matches, which translates to higher BLEU-3 and lower BLEU-1 in Table 2.The proposed coarse-to-fine generation can yield reasonable improvement over NLI-Acc and Adv-Acc, which demonstrates its advantages of in capturing logical dependency.
Human Evaluation To further investigate the quality of the generated text, we propose to perform human evaluation.Specifically, we sample 200 sentences from different models and distribute them independently to human experts (graduate students from the computer science department) to verify their quality.Specifically, the quality measure is categorized into categories: 1) non-sense: the sentence does not make much sense, which is mainly due to disfluency or repetition problem.
2) wrong: a fluent sentence with wrong logic.3) partial-correct: the sentence contains more than one fact, at least one of them is correct 4) correct: the high-quality in both fluency and logic correctness.We demonstrate the results in Figure 8, from which we observe that pre-training significantly decreases the non-sense proportion.However, the RL and Adv-Reg both harm the fluency and lead to more non-sense sentences.In contrast, the coarse-to-fine model can maintain the non-sense proportion while significantly increasing correct/partial-correct sentences.From human evaluation, even the best performing model can get slightly over 20% of its prediction logically correct, which reflects the challenges of LOGICNLG for existing paradigm.

Evaluation Metrics
We here analyze the effectiveness of the defined automatic evaluation metrics for fidelity evaluation.

Related Work
Natural Language Generation Natural language generation is a long-standing problem (Kukich, 1983;Holmes-Higgin, 1994;Reiter and Dale, 1997), which involves generating text from records or data.Recently, many neural-based generation models have been proposed (Puduppully et al., 2019a,b;Lebret et al., 2016;Wiseman et al., 2018) to achieve impressive performance on the existing datasets (Chen and Mooney, 2008;Liang et al., 2009;Lebret et al., 2016;Dušek et al., 2019;Wiseman et al., 2017) since the annotated text are mostly surface-level annotation without logical inference.Unlike them, LOGICNLG has rich inference, which poses great challenges to existing models and evaluations.
Non-monotonic Generation There have been attempts recently to study the problem of nonmonotonic text generation, which aims to teach the generation model to learn the generation order without external supervision (Ford et al., 2018;Welleck et al., 2019;Gu et al., 2019;Mansimov et al., 2019).These models have shown to learn rational generation order to approach similar performance as the left-to-right case.These approaches are useful at capturing more sophisticated dependency within the sentence, which provides a plausible direction to pursue in LOGICNLG.
Factualness Evaluation Fidelity is an important research topic in generation, In ROTOWIRE (Wiseman et al., 2017) and MSCOCO (Lin et al., 2014), IE-based extractive evaluation (Rohrbach et al., 2018;Dhingra et al., 2019) are adopted for surfacelevel matching to replace costly human evaluation.In abstractive summarization, Goodrich et al. (2019) proposes NER + Relation Classification method to investigate fidelity in generated summarization while Kryściński et al. (2019) proposes to use NLI models to understand the entailment between generated text with the given document.These evaluations are beyond surface-level to study more sophisticated linguistic phenomena like paraphrasing, compression, entailment, inclusion, etc, which are common in summarization tasks.

Conclusion
In this paper, we propose logical NLG to study the logical inference problem in generation.We conduct comprehensive experiments to show the existing NLG models are restricted by its monotonic nature and conclude this to be a proper nextstep problem to study NLG systems.There are still some unsolved problems for Logical NLG, e.g.how to improve the quality of automatic metrics to better help human automatically judge models' performances.To promote the research in this direction, we host a LogicNLG challenge2 to help better benchmark the current progress.

A Dataset Examples
In order to give readers a better sense of the statements in LOGICNLG, we demonstrate some typical examples below as Figure 9 and Figure 10.Each table in the dataset is associated with five different examples covering diversified inference skills.For example, Figure 9 requires 'all' operation to identify multiple rows having the same value on certain properties.Figure 10 requires the model to perform superlative, or count operation to identify the numerically highest number.

B Logical Operation Distribution
The dataset consists of the most common types of logical inference in our daily communication, to help the readers understand the semantic meaning of these inference, we list their definition and some examples below: • superlative: operations involving max,min or other comparison operation to get the lowest or highest value.Sentence: xxx is the tallest player in xxx team.
• only: operation to identify the single entity which has a unique property the other entries do not have.Sentence: xxx is the only person to win all the games.
• before/after: operations to compare time or spatial order.Sentence: xxx is born before xxx.
• count: operations to enumerate the amount of entries meeting certain criterion.Sentence: there are two people from the central united states.
• comparison: operations to compare two or given number of entities.Sentence: xxx has better income than xxx.
• both/neither: operations to summarize the common properties of two entries.Sentence: xxx and xxx are both from the same country.
• sum/diff: operations to perform numeric summation or difference between numbers.Sentence: xxx gives 1 more dollars than xxxx in the donation.
• average: the average number of people attending the game is 500.
• unique: the uniq operation in sql to assemble summarize different entities.Sentence: from the table, players are from 4 unique countries.

C Semantic Parser
Specifically, the scorer is realized by a matching model f γ , which takes a logic form P and the statement Y to output a consistency score f γ (P, Y ) between range of [0,1] with higher value indicating better consistency.As no groundtruth logical forms are provided, we utilize weakly supervised training.
The set of logical forms generated is denoted as P, the logical forms returning binary value of True is viewed as pseudo positive example P + and the logical forms returning False is treated as pseudo negative example P − .We propose to optimize the following objective to discriminate two sets: and what are the numbers it needs to infer.These results are pushed to a buffer as the initial point, then the BFS search will try to compose plausible logical forms based on the values from the buffer.However, most of the synthesized logical forms are not relevant to the semantics the sentence is aimed to express.In the end, we need to train a ranker, which can learn to identify the most consistent logical form and use that to represent the formal semantics of given sentence.

D Qualitative Example
Next, we demonstrate some generated samples in Figure 12, which are generated from a table crawled from Wikipedia page3 .Though most of the text generated by the model is coherent and reasonable, we do observe some disfluency like repetition, contradiction, erroneous sentences like the sentence 5.For the other sentences, three of them are logically correct, the first sentence contains quite complex logic with three different symbolic operations "argmax, argmin, after".The fourth and sixth sentences involve operations like "filter, count".In contrast, the second and third examples

Figure 1 :
Figure 1:Table-to-text generation examples with and without implicit logical inference.Logical NLG requires a generation model to generate natural language statements that can be logically entailed by the facts in the table instead of simply restating certain superficial facts in natural language.

Figure 5 :
Figure 5: The parsing-based and adversarial evaluation to measure model's correctness in logical reasoning.

Figure 7 :
Figure 7: Coarse-to-fine generation scheme: first generates a template, and then realize the surface form.It exposes more context to the surface realization model for better capturing logical dependency.

Figure 8 :
Figure 8: The human evaluation results of different models on the sampled sentences.
P, Y )] − E P ∈P − [fγ(P, Y )]] As demonstrated in Figure 11, our semantic parser is comprised of three different parts, namely a resolution model, a breadth-first search model and a ranker model.The resolution model will try to figure out what are the entities appearing in the table

Table Linearization
(Liu et al., 2019;Zhang et al., 2019)g knowledge base as natural language(Liu et al., 2019;Zhang et al., 2019)to propose "table linearization", which uses template to flatten the table T as a document P T = w 1 , • • • , w |T | fed into pre-trained language models to generate statement Y , where we use w i to denote the i-th word in the generated paragraph P T and |T | to denote the length of the paragraph (the word w i is either a table entry or a functional word in the template).As depicted in the left bottom part of Figure6, the original table T is transformed into a paragraph by horizontally scanning each cell

Table 2 :
The experimental results of different models on the test split of LOGICNLG, where we split the table into non-pretrained LSTM/Transformer, small pre-trained LM (sm) and medium/large pre-trained LM (med/lg).