TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching

Although neural table-to-text models have achieved remarkable progress with the help of large-scale datasets, they suffer insufficient learning problem with limited training data. Recently, pre-trained language models show potential in few-shot learning with linguistic knowledge learnt from pretraining on large-scale corpus. However, benefiting table-to-text generation in few-shot setting with the powerful pretrained language model faces three challenges, including (1) the gap between the task’s structured input and the natural language input for pretraining language model. (2) The lack of modeling for table structure and (3) improving text fidelity with less incorrect expressions that are contradicting to the table. To address aforementioned problems, we propose TableGPT for table-to-text generation. At first, we utilize table transformation module with template to rewrite structured table in natural language as input for GPT-2. In addition, we exploit multi-task learning with two auxiliary tasks that preserve table’s structural information by reconstructing the structure from GPT-2’s representation and improving the text’s fidelity with content matching task aligning the table and information in the generated text. By experimenting on Humans, Songs and Books, three few-shot table-to-text datasets in different domains, our model outperforms existing systems on most few-shot settings.


Introduction
to-text generation, aiming at generating descriptive text about important information in structured data, has well application prospect in communicating with human in a comprehensible and natural way, such as financial report (Murakami et al., 2017), medical report (Hasan and Farri, 2019) generation, etc. In recent years, data-driven models have shown impressive capability to produce informative and fluent text with the help of large-scale datasets, such as WIKIBIO (Lebret et al., 2016) and E2E (Dušek et al., 2020). However, it is not always feasible to collect large-scale labeled dataset for various domains in the real world, resulting in unsatisfying performance due to the insufficient training. Such few-shot learning setting for table-to-text generation is not well-explored, and in this paper, we focus on exploring how to efficiently model for few-shot table-to-text generation with limited training pairs.
Recently, pre-trained language models have shown promising progress in various natural language processing tasks Devlin et al., 2019;Radford et al., 2019). They can capture linguistic knowledge by pretraining on large-scale unlabeled dataset and generalize to downstream tasks with little labeled data in target domain, effectively modeling for few-shot setting (Peng et al., 2020). However, efforts to benefit table-to-text generation from the powerful pre-trained language model, especially in few-shot setting, are non-trivial due to three challenges. (1) There is a gap between the structured data input for table-to-text generation and natural language input that is used for pretraining  (2) Also, it lacks modeling of the table's structure which contains rich information to understand the input before generating text. (3) Additionally, it doesn't address how to maintain text's fidelity for table-to-text gener-  ation while exploiting linguistic knowledge from pretraining corpus, that is the (highlighted) information in text (Table 1) should correctly derive from structured data. In order to alleviate aforementioned problems, we propose TableGPT that focus on generating highfidelity text for table-to-text generation with limited training pairs. Addressing the gap between structured table input and natural language input that GPT-2 processes during pretraining, we utilize a table transformation module that employs template to naturally transform structured table into natural language. In addition, we utilize two auxiliary tasks under the framework of multi-task learning, table structure construction and content matching, targeting pretrained GPT-2's lack of modeling for table structure and text's fidelity. In detail, the table structure reconstruction task is proposed for GPT-2 which force it to embed table structure into its representation when modeling structured table. Besides, we utilize content matching task that help model correctly describe important information from table via Optimal-Transport  technique, which measures the distance between the information in generated text and information in table and use the distance as penalty for text with incorrect information.
We conducted experiments on three data-to-text datasets on different domains (Chen et al., 2020b): Humans, Books, Songs in various settings. Both automatic evaluation and human evaluation results show that our model can achieve new state-of-the-art performance for table-to-text generation in terms of generating fluent and high-fidelity text in most few-shot settings.

Task Definition
For the table-to-text task discussed in this paper, we can formulate each training instance as pair of table and summary E = (S, T ). Given a table, which can be formulated as sets of records S = {r i } N i=1 , the model is expected to generate descriptive text T = w 1 , w 2 , ..., w L . N is the number of records and L is the number of words in text. Each record r i consists of two type of information: r i .a and r i .v. r i .a denotes the attribute of the record (e.g. name) and r i .v denotes the corresponding value (e.g. james beattie). Please note that both r i .a and r i .v can be viewed as a sequence of words.

Pre-trained Language Model
Recently, pre-trained language models, such as BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), XLNet ) and more, have achieved remarkable progress in various NLP tasks. The main idea is to pretrain a neural language model with large number of parameters on large-scale dataset in order to capture the linguistic knowledge at first. Then, transfer those knowledge to downstream task (a) Table Structure Reconstruction and (c) Content Matching are auxiliary tasks. The former reconstructs attributes from GPT-2's representation of value tokens, forming L SR . The latter use Optimal Transport to measure distance between information mentioned in text and table, using this as content matching loss L CM . The model is jointly finetuned with above three losses. via finetuning on the task's dataset. Impressively, they can outperform various NLP tasks' previous stateof-the-arts by a large margin. Since we investigate table-to-text generation task in this paper instead of natural language understanding tasks, we choose GPT-2 as the basis of our model. The model structure for GPT-2 is a 12-to-48-layer transformer decoder (Vaswani et al., 2017) with 117 million to 1542 million parameters. Each layer consists of a stack of masked multi-head self attention and feed-forward neural network with residual connection and layer normalization. The pre-training target of GPT-2 is as the same as language model: maximizing gold text's probability. The large-scale model is trained on vast and diverse WebText dataset with 8 million documents collected from the Internet. Its success in the area of text generation attributes to both its high capacity model and knowledge learnt from pretraining on vast dataset.

Approach
In this section, we propose to address the three incompatibility between pretrained language model and table-to-text generation illustrated in Section 1. Figure 1 presents the overall multi-task training framework of our model. We first utilize table transformation module to reasonably transform structured table into text sequence and aggregate it with reference text, resulting in a suitable training sequence for GPT-2 model. Then, two auxiliary tasks: table structure reconstruction and content matching with optimal transport  are performed on top of GPT-2's representation of the training sequence. Those two auxiliary tasks' training objectives along with GPT-2's language model training objective are jointly finetuned based on the pre-trained GPT-2 under the multi-task training framework. The overall objective here is to produce high-fidelity text while maintaining its fluency.

Table Transformation
As noted in Section 2.1, the given structured table consists of multiple records as attribute-value pairs. In order to adapt to the sequential nature of language model, we employ a template-based table serialization method (Chen et al., 2019b) to encode such structured table as a sequence. For example, we serialized the attribute-value pair "name: jack reynolds" as a sentence "name is jack reynolds." and concatenate all of them into a document according to the order of records in table.
After obtaining the serialized structured table, we connect it with the corresponding natural language description T with special token "< table2text >". It serves as a functional token that both encodes the overall information of the table and as a starting signal to generate text. The whole sequence is ended with special token "< endof text >". In this way, our model encodes the structured table and learns to predict the target sequence one word at a time as in GPT-2. We denote the final input sequence as ST = st 1 , · · · , st m+n+2 . m is the length of serialized table, n is the length of text and 2 refers to the special tokens mentioned above. The language model's training objective is to maximize the likelihood of the reference text, which is equivalent to minimize the negative log likelihood (language model loss L LM ) as characterized by Equation 1.

Table Structure Reconstruction
As shown in Table 1, unlike many natural language generation tasks that take sentences as input, tableto-text generation models need to process table with structural information. Each data record in the input can be seen as a pair of attribute and value. Traditionally, table-to-text models utilized attribute-value concatenation to represent the tables. In this way, they are able to capture the structural information by learning the correspondence between value and attribute. However, when we transform the table into natural language and use pre-trained language model GPT-2 for representation, it lacks the explicit modeling to incorporate such structural information. Inspired by , our model treats the attribute names as the labels for the model to reconstruct such structural information from GPT-2's learned table values' representation.
In detail, as shown in Figure 1 (b), given a serialized table S k which consists of different attributes and values in natural language form, the table structure reconstruction task takes the last layer of GPT-2's hidden states for each value tokens [H t i,j ] i=1:n,j=1:m i of the table and classify which attribute does each value token's representation refer to. Specifically, the i means the ith record of the table S k , j means the jth value token of ith record, n is the number of records and m i is the number of tokens of ith record's value. (2) Equation 2 shows the detail of the reconstruction classifier. Please note that the serialized table consists of attribute, value and template tokens. In this auxiliary task, we only take the GPT-2's hidden states for those value tokens and reconstruct the structural information by classifying their corresponding attribute. W t and b t are the trainable parameters of the introduced classifier and p(a i,j ) is probability to classify value token H t i,j as referring to attribute a i,j . We use cross entropy as this task's objective function, illustrated by Equation 2 and 3. a i,j refers to the gold attribute label for the value token and Z is the number of value tokens' that need to reconstruct the corresponding attribute label. By incorporating this auxiliary task, TableGPT can be guided to embed structural information when representing the table at the training stage.

Content Matching
Take Table 1 as an example, generating high fidelity text that correctly describe information in the table is the core of table-to-text generation. Producing fluent but incorrect text still means unsatisfying performance as the text is not reliable for the purpose of disseminating comprehensible information. Ideally, when generating words that is intended to describe information in table, directly copying them from table will result in high-fidelity text. However, it's non-trivial to integrate a copy mechanism inside the transformer architecture of GPT-2 model, since the change of model structure may break syntactic and semantic features contained in the pretrained language model, which are essential for text generation especially in few-shot setting. Also, rephrasing sometimes is needed to produce more natural text. In order to encourage our model to generate high-fidelity text while keeping GPT-2's advantage of produing fluent text, we utilize another auxiliary task, called content matching task, during finetuning on the table-to-text corpus. The content matching task is to explicitly match the important information in a table with information in the corresponding generated text. An intuitive way is to apply a mis-matching loss by hard-matching key information in table and information in the generated text. But that is discrete and non-differentiable and the corresponding gradient descent can't be learned directly. Inspired by optimal transport (OT) that can measure the distance between information in source sequence and target sequence  without breaking the end-to-end training process, we adopt it as a content matching loss that guide the model to generate text containing information that align with the table.
As in Section 3.1, the whole GPT-2 training sequence consists of serialized table and reference text. The serialized table sequence, x = x 1 , · · · , x m , can be represented as a discrete distribution µ = m i=1 u i δ x i , where u i ≥ 0 and i u i = 1, m is the length and δ x is the Dirac function centered on x. Similar with the serialized table sequence, the discrete distribution of reference text sequence y = y 1 , · · · , y n can be represented as ν = n j=1 v j δ y j . Under such setting, computing the OT (optimal transport) distance between probability distributions u = {u i } m i=1 and v = {v j } n j=1 is defined as the solution of the following network-flow problem (Luise et al., 2018): where Π(µ, ν) = {T ∈ R m×n + |T1 n = µ, T 1 m = ν} which is the set of joint distribution of the two marginal distribution u and v, 1 n and 1 m are n-dimensional all-one vector and m-dimensional all-one vector respectively, and d(x i , y j ) denotes the cost of moving x i to y j . Especially, we adopt the cosine distance between two token embedding vectors of x i and y j , which is defined as d(x i , y j ) = 1 − x i y j x i 2 y j 2 . Exact minimization over T is computational intractable. In order to overcome this problem, we use the recently proposed Inexact Proximal point method for Optimal Transport (IPOT)  as an approximation.
For natural language generation tasks such as neural machine translation, OT distance is often applied by matching source sequence with whole target sequence, since almost every word in both sequences are supposed to be matched. However, when it comes to table-to-text task in a realistic way, there are some redundant information or words in both table and text. In order to apply the OT distance, unlike previous adoption (Wang et al., 2020) based on the assumption that all information in the table should be described in text, we propose to only match the record words which appear in both table and reference text. In this way, the OT distance is able to avoid wrongly penalizing text that doesn't mention redundant information in table.

Learning Objective
For table structure reconstruction and content matching, both auxiliary tasks are trained with the main GPT-2's language model loss together, which can be regarded as multi-task learning. The loss function of multi-task learning consists of language model loss L LM , table structure reconstruction loss L SR and content matching loss L CM . In this way, the loss function L M T of the full model is computed as follows: where λ 1 and λ 2 are hyper-parameters that are two scale factors. Please note that when optimizing the L CM with IPOT algorithm, the gradients of OT loss are hard to back propagate to model's parameters, since the process of sampling words from multinomial distribution which comes from language model  We implement TableGPT based on the transformers library (Wolf et al., 2019). The configuration of base GPT-2 model is 12 layers and 8 attention heads per layer. For optimizer, we adopt the OpenA-IAdamW optimizer with 100 warm steps. We train the model with learning rate set to 2e-4. The batch size is set to 10 for all datasets. The weights λ 1 , λ 2 of the table structure reconstruction loss and content matching loss are both 0.2 according to performance on validation set. Following Chen et al. (2020b)'s way to deal with the vocabulary limitation, for all datasets we use the Byte Pair Encoding (BPE) and subword vocabulary as in Radford et al. (2019).

Comparing Methods
We compare our proposed TableGPT with baseline model and previous state-of-the-art model: Base and Base+switch+LM. More details can be found in Chen et al. (2020b).  Table 4: Human evaluation results. Models with perform significantly different from TableGPT (p < 0.05), using a one-way ANOVA with posthoc Tukey HSD tests.
• Base: It is based on a Seq2Seq model with field-gating encoder that incorporate the table structure's information during encoding . Additionally, it utilizes the pre-trained word embedding which is fixed during the training stage. Since it achieves competitive performance on large-scale dataset, it can show how a data-driven Seq2Seq model typically perform with limited training data. • Base + switch + LM: It tries to exploit GPT-2's learnt knowledge from pretraining on vast corpus by proposing a switch policy that choosing between copying from infobox and generating from the GPT-2 language model when generating each word in text. We also use the released codes and data 1 by Chen et al. (2020b) to reproduce its result for human evaluation and report corresponding automatic evaulation results as Base + switch + LM(R). • TableGPT: In this paper, we propose TableGPT that exploits GPT-2's learnt knowledge from pretraining on vast corpus for few-shot learning while enhance it for generating high fidelity text with two auxiliary tasks. Also, we perform ablation studies for evaluating each auxiliary task's contribution. -sr represents the variant without table structure reconstruction, -cm represents the variant without content matching and -sr&cm shows the performance of GPT-2 without auxilary tasks.

Automatic Evaluation
Following the previous work Chen et al. (2020b), we adopt BLEU-4 (Papineni et al., 2002) and ROUGE-4 (Lin, 2004) to conduct automatic evaluations. Table 2 and 3 show corresponding results of comparing methods on different datasets. Although the Base achieves competitive results when training on largescale dataset , the performance drops drastically in few-shot setting. While utilizing a switch policy to combine copying words from table and generating words from GPT-2 (Base + switch + LM) can achieve impressive performance in all few-shot setting, a standard GPT-2 model (TableGPT -sr&cm) that takes a serialized table as input and generate text afterwards without copying can actually perform better in most of the few-shot setting. TableGPT with table structure reconstruction and content matching that preserves structural information during encoding and guide the model to generate high-fidelity text can further improve the performance. Ablation studies also show that each of the auxiliary task attributes to the performance enhancement and applying both of them can achieve the best performance in most of the few-shot setting.

Human Evaluation
Following the settings in Chen et al. (2020b), we conduct human evaluation on TableGPT with previous state-of-the-art model Base+switch+LM(R) and Reference from two aspects: Factual correctness and Language naturalness. We sampled 100 tables along with corresponding generated text from Humans, Books and Songs test set (under the few-shot setting of 100 training data) respectively, resulting in 300 tables in total. In order to reduce variance caused by human, each example is evaluated by three different graduates who have passed intermediate English test and the scores are averaged in Base+switch+LM(R): james beattie ( , born 10 july 1971 ) is an english former professional association footballer who played for , among others , England .
TableGPT: james beattie ( born 27 february 1978 in lancaster ) is a former english footballer who played as a striker . noted as #Sup , and how many are contradicting with or missing from the table, noted as #Cont. We report the average number of supporting facts (#Sup) and contradicting facts (#Cont) on different dataset in Table 4.
The second evaluation criteria Language naturalness tries to evaluate these models from the aspect of grammaticality (is the sentence grammatically correct?) and fluency of the text (is the sentence fluent and natural?). We arrange text from different models on the same table into 3 pairs. For each pair of text without table, raters are asked to decide which one is better or whether both text are of same quality, solely in terms of language naturalness. When a generated summary is chosen as the better one, we assign 1.0 score to the better one and 0.0 score to the worse. If two summaries are deemed of same quality, we assign 0.5 score to both of them. We then calculate the average scores and report results on different dataset in Table 4.
Results show that TableGPT can produce less contradicting facts than previous state-of-the-art model Base+switch+LM(R) on Humans and Books and achieves comparable performance on Songs. Meanwhile, TableGPT can include more supporting facts on Humans and Songs and generate more natural text than Base+switch+LM(R) on Humans and Songs. Overall, our TableGPT model can improve text fidelity while preserving the naturalness of the text.

Case Study
Compared with the previous state-of-the-art model Base+switch+LM(R) and Reference, TableGPT performs better. It can accurately describe most of the key information in fluent text compared with reference. For Base+switch+LM(R), the design of separate copy mechanism and GPT-2 may attributes to the inconsistent expression "played for , among others , England ." and the expression of wrong birth date, which shows the imperfect switch policy on deciding when to copy from table can sometimes hurt model's ability to generate high-fidelity text. On the contrary, TableGPT, enhanced to generate high-fidelity text with two auxiliary tasks without breaking one unified GPT-2 model for generating text, performs better in terms of fidelity and fluency of text in this example.

Related Work
In recent years, neural models for generating texts directly from preprocessed data (Wiseman et al., 2017;Puduppully et al., 2019a;Puduppully et al., 2019b;Gong et al., 2019;Feng et al., 2020), have become mainstream for table-to-text generation and achieved impressive performance with the help of largescale dataset. Mei et al. (2016) proposes a pre-selector on encoder-aligner-decoder model for generation, which strengthens model's content selection ability and obtains considerable improvement over standard Seq2Seq model.  proposes a hybrid attention mechanism for modeling the order of content when generating texts.  presents a field-gating encoder focusing on modeling table structure and dual attention mechanism to utilize the structure information when decoding. In addition, Bao et al. (2018) develops a table-aware sequence-to-sequence model on this task. However, Chen et al. (2020b) shows that the well-performed Seq2Seq model trained on large-scale dataset suffer from limited training data in few-shot setting.
Recently, GPT-2 has been successfully adapted to dialogue generation in few-shot setting (Zhang et al., 2020;Peng et al., 2020), showing potential to address insufficient training data problem for few-shot learning with the help of learnt knowledge from pretraining on vast corpus. As for table-to-text generation, Chen et al. (2020b) propose a switch model that use GPT-2 to generate template-like functional words while generating factual expressions via copying records' values from table in few-shot scenario. Different from this work, we model the table and generate text within a GPT-2 model in a unified way and we show that our proposed TableGPT can perform well in the few-shot scenario. In addition, different from both works mentioned above, we enhance GPT-2's ability to model table structure and to improve text fidelity. Another closely related paper is Chen et al. (2019b), which predicts whether a statement align with records in the table. Since the nature of classification task makes it possible to model table records bidirectionally, it use BERT with templates to transform and model the table. Meanwhile, Chen et al. (2020a) explores coarse-to-fine table-to-text generation with standard GPT-2 model. Different from above two works, we adapted GPT-2 in a text generation scenario for structured data input and more importantly address table structure modeling and improve text fidelity. In addition, one of the auxiliary task: content matching is inspired by ideas in machine translation (Yang et al., 2019a) and Seq2Seq  model. The closest paper on data-to-text generation is Wang et al. (2020). They assume that the expected generated text should cover all information in the table. But in a more realistic scenario, like the task we explore in this paper, the table consists of redundant information and only the important ones should be used to constraint model to generate high-fidelity text. Therefore, we propose to match important information only in the table and information in text as an auxiliary task during training.

Conclusion
In this work, we present TableGPT, which enhances GPT-2 for table-to-text generation with two auxiliary tasks, table structure reconstruction and content matching, for improving text fidelity while exploiting GPT-2's learnt linguistic knowledge from pretraining on large-scale corpus. In detail, we use table transformation to bridge the gap between structured table and natural language input for GPT-2 and further enhance GPT-2 with following two auxiliary tasks for table-to-text generation. The table structure reconstruction task help model preserve the structural information of input while representing table with powerful pretrained GPT-2. In addition, the content matching task guides model to generate high-fidelity task with less incorrect expressions that are contradicting to the table via measuring distance between table and information in generated text. Experiments are conducted on three datasets, Humans, Books and Songs, in different domains. Both automatic evaluation and human evaluation results show that our model achieves new state-of-the-art performance in most few-shot setting.