Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation

Table-to-text generation aims to translate the structured data into the unstructured text. Most existing methods adopt the encoder-decoder framework to learn the transformation, which requires large-scale training samples. However, the lack of large parallel data is a major practical problem for many domains. In this work, we consider the scenario of low resource table-to-text generation, where only limited parallel data is available. We propose a novel model to separate the generation into two stages: key fact prediction and surface realization. It first predicts the key facts from the tables, and then generates the text with the key facts. The training of key fact prediction needs much fewer annotated data, while surface realization can be trained with pseudo parallel corpus. We evaluate our model on a biography generation dataset. Our model can achieve 27.34 BLEU score with only 1,000 parallel data, while the baseline model only obtain the performance of 9.71 BLEU score.


Introduction
to-text generation is to generate a description from the structured table. It helps readers to summarize the key points in the table, and tell in the natural language. Figure 1 shows an example of table-to-text generation. The table provides some structured information about a person named "Denise Margaret Scott", and the corresponding text describes the person with the key information in the table. Table-to-text generation can be applied in many scenarios, including weather report generation (Liang et al., 2009), NBA news writing (Barzilay and Lapata, 2005), biography generation (Duboué and McKeown, 2002;Lebret et al., 2016), and so on. Moreover, table-to-text genera-tion is a good testbed of a model's ability of understanding the structured knowledge.
Most of the existing methods for table-totext generation are based on the encoder-decoder framework Bahdanau et al., 2014). They represent the source tables with a neural encoder, and generate the text word-byword with a decoder conditioned on the source table representation. Although the encoder-decoder framework has proven successful in the area of natural language generation (NLG) (Luong et al., 2015;Chopra et al., 2016;Lu et al., 2017;, it requires a large parallel corpus, and is known to fail when the corpus is not big enough. Figure 2 shows the performance of a table-to-text model trained with different number of parallel data under the encoder-decoder framework. We can see that the performance is poor when the parallel data size is low. In practice, we lack the large parallel data in many domains, and it is expensive to construct a high-quality parallel corpus. This work focuses on the task of low resource table-to-text generation, where only limited paral- lel data is available. Some previous work (Puduppully et al., 2018;Gehrmann et al., 2018) formulates the task as the combination of content selection and surface realization, and models them with an end-to-end model. Inspired by these work, we break up the table-to-text generation into two stages, each of which is performed by a model trainable with only a few annotated data. Specifically, it first predicts the key facts from the tables, and then generates the text with the key facts, as shown in Figure 1. The two-stage method consists of two separate models: a key fact prediction model and a surface realization model. The key fact prediction model is formulated as a sequence labeling problem, so it needs much fewer annotated data than the encoder-decoder models. According to our experiments, the model can obtain 87.92% F1 score with only 1, 000 annotated data. As for the surface realization model, we propose a method to construct a pseudo parallel dataset without the need of labeled data. In this way, our model can make full use of the unlabeled text, and alleviate the heavy need of the parallel data. The contributions of this work are as follows: • We propose to break up the table-to-text generation into two stages with two separate models, so that the model can be trained with fewer annotated data.
• We propose a method to construct a pseudo parallel dataset for the surface realization model, without the need of labeled data.
• Experiments show that our proposed model can achieve 27.34 BLEU score on a biography generation dataset with only 1, 000 table-text samples.

PIVOT: A Two-Stage Model
In this section, we introduce our proposed twostage model, which we denote as PIVOT. We first give the formulation of the table-to-text generation and the related notations. Then, we provide an overview of the model. Finally, we describe the two models for each stage in detail.

Formulation and Notations
Suppose we have a parallel table-to-text dataset P with N data samples and an unlabeled text dataset U with M samples. Each parallel sample consists of a source table T and a text description y = {y 1 , y 2 , · · · , y n }. The table T can be formulated as K records T = {r 1 , r 2 , r 3 , · · · , r K }, and each record is an attribute-value pair r j = (a j , v j ). Each sample in the unlabeled text dataset U is a piece of textȳ = {ȳ 1 ,ȳ 2 , · · · ,ȳ n }. Formally, the task of table-to-text generation is to take the structured representations of table T = {(a 1 , v 1 ), (a 2 , v 2 ), · · · , (a m , v m )} as input, and output the sequence of words y = {y 1 , y 2 , · · · , y n }. Figure 3 shows the overview architecture of our proposed model. Our model contains two stages: key fact prediction and surface realization. At the first stage, we represent the table into a sequence, and use a table-to-pivot model to select the key facts from the sequence. The table-topivot model adpots a bi-directional Long Shortterm Memory Network (Bi-LSTM) to predict a binary sequence of whether each word is reserved as the key facts. At the second stage, we build a sequence-to-sequence model to take the key facts selected in the first stage as input and emit the table description. In order to make use of the unlabeled text corpus, we propose a method to construct pseudo parallel data to train a better surface realization model. Moreover, we introduce a denoising data augmentation method to reduce the risk of error propagation between two stages.

Preprocessing: Key Fact Selection
The two stages are trained separately, but we do not have the labels of which words in the table are the key facts in the dataset. In this work, we define the co-occurrence facts between the table and the text as the key facts, so we can label the key facts automatically. Algorithm 1 illustrates the process of automatically annotating the key facts. Given a table and its associated text, we enumerate each attribute-value pair in the table, and compute the word overlap between the value and the text. The word overlap is defined as the number of words that are not stop words or punctuation but appear in both the table and the text. We collect all values that have at least one overlap with the text, and regard them as the key facts. In this way, we can obtain a binary sequence with the 0/1 label denoting whether the values in the table are the key facts. The binary sequence will be regarded as the supervised signal of the key fact prediction model, and the selected key facts will be the input of the surface realization model.

Stage 1: Key Fact Prediction
The key fact prediction model is a Bi-LSTM layer with a multi-layer perceptron (MLP) classifier to determine whether each word is selected. In order to represent the table, we follow the previous work  to concatenate all the words in the values of the table into a word sequence, and each word is labeled with its attribute. In this way, the table is represented as two sequences: the value sequence {v 1 , v 2 , · · · , v m } and the attribute sequence {a 1 , a 2 , · · · , a m }. A word embedding and an attribute embedding are used to transform

Algorithm 1 Automatic Key Fact Annotation
Input: A parallel corpora P = {(xi, yi)}, where xi is a table, and yi is a word sequence. 1: Initial the selected key fact list W = [] 2: for each sample (x, y) in the parallel dataset P do 3: Initial the selected attribute set A = {} 6: Initial the selected key fact list Wi = [] 7: for each attribute-value pair (vi, ai) in table x do 8: if vi in y And vi is not stop word then 9: Append attribute ai into attribute set A 10: end if 11: if ai in A then 12: Append value vi into key fact list Wi 13: end if 14: end for 15: Collect the key fact list W += Wi 16: end for Output: The selected key fact list W two sequences into the vectors. Following (Lebret et al., 2016;, we introduce a position embedding to capture structured information of the table. The position information is represented as a tuple (p + w , p w ), which includes the positions of the token w counted from the beginning and the end of the value respectively. For example, the record of "(Name, Denise Margaret Scott)" is represented as "({Denise, Name, 1, 3}, {Margaret, Name, 2, 2}, {Scott, Name, 3, 1})". In this way, each token in the table has an unique feature embedding even if there exists two same words. Finally, the word embedding, the attribute embedding, and the position embedding are concatenated as the input of the model x.
where h t and h t are the forward and the backward hidden outputs respectively, h t is the concatenation of h t and h t , and x t is the input at the t-th time step. Classifier: The output vector h t is fed into a MLP classifier to compute the probability distribution of the label p 1 (l t |x) where W c and b c are trainable parameters of the classifier.

Stage 2: Surface Realization
The surface realization stage aims to generate the text conditioned on the key facts predicted in Stage 1. We adpot two models as the implementation of surface realization: the vanilla Seq2Seq and the Transformer (Vaswani et al., 2017).
Vanilla Seq2Seq: In our implementation, the vanilla Seq2Seq consists of a Bi-LSTM encoder and an LSTM decoder with the attention mechanism. The Bi-LSTM encoder is the same as that of the key fact prediction model, except that it does not use any attribute embedding or position embedding.
The decoder consists of an LSTM, an attention component, and a word generator. It first generates the hidden state s t : where f (·, ·) is the function of LSTM for one time step, and y t−1 is the last generated word at time step t − 1. Then, the hidden state s t from LSTM is fed into the attention component: where Attention(·, ·) is the implementation of global attention in (Luong et al., 2015), and h is a sequence of outputs by the encoder. Given the output vector v t from the attention component, the word generator is used to compute the probability distribution of the output words at time step t: where W g and b g are parameters of the generator. The word with the highest probability is emitted as the t-th word.
Transformer: Similar to vanilla Seq2Seq, the Transformer consists of an encoder and a decoder. The encoder applies a Transformer layer to encode each word into the representation h t : Inside the Transformer, the representation x t attends to a collection of the other representations x = {x 1 , x 2 , · · · , x m }. Then, the decoder produces the hidden state by attending to both the encoder outputs and the previous decoder outputs: v t = Transformer(y t , y <t , h) Finally, the output vector is fed into a word generator with a softmax layer, which is the same as Eq. 5. For the purpose of simplicity, we omit the details of the inner computation of the Transformer layer, and refer the readers to the related work (Vaswani et al., 2017).

Pseudo Parallel Data Construction
The surface realization model is based on the encoder-decoder framework, which requires a large amount of training data. In order to augment the training data, we propose a novel method to construct pseudo parallel data. The surface realization model is used to organize and complete the text given the key facts. Therefore, it is possible to construct the pseudo parallel data by removing the skeleton of the text and reserving only the key facts. In implementation, we label the text with Stanford CoreNLP toolkit 2 to assign the POS tag for each word. We reserve the words whose POS tags are among the tag set of {NN, NNS, NNP, NNPS, JJ, JJR, JJS, CD, FW}, and remove the remaining words. In this way, we can construct a large-scale pseudo parallel data to train the surface realization model.

Denoising Data Augmentation
A problem of the two-stage model is that the error may propagate from the first stage to the second stage. A possible solution is to apply beam search to enlarge the searching space at the first stage. However, in our preliminary experiments, when the beam size is small, the diversity of predicted key facts is low, and also does not help to improve the accuracy. When the beam size is big, the decoding speed is slow but the improvement of accuracy is limited.
To address this issue, we implement a method of denoising data augmentation to reduce the hurt from error propagation and improve the robustness of our model. In practice, we randomly drop some words from the input of surface realization model, or insert some words from other samples. The dropping simulates the cases when the key fact prediction model fails to recall some cooccurrence, while the inserting simulates the cases when the model predicts some extra facts from the table. By adding the noise, we can regard these data as the adversarial examples, which is able to improve the robustness of the surface realization model.

Training and Decoding
Since the two components of our model are separate, the objective functions of the models are optimized individually.

Training of Key Fact Prediction Model:
The key fact prediction model, as a sequence labeling model, is trained using the cross entropy loss: Training of Surface Realization Model: The loss function of the surface realization model can be written as: wherex is a sequence of the selected key facts at Stage 1. The surface realization model is also trained with the pseudo parallel data as described in Section 2.6. The objective function can be written as: whereȳ is the unlabeled text, andx is the pseudo text paired withȳ.
Decoding: The decoding consists of two steps. At the first step, it predicts the label by the key fact prediction model: The word withl t = 1 is reserved, while that witĥ l t = 0 is discarded. Therefore, we can obtain a sub-sequencex after the discarding operation. At the second step, the model emits the text with the surface realization model: where V is the vocabulary size of the model. Therefore, the word sequence {ŷ 1 ,ŷ 2 , · · · ,ŷ N } forms the generated text.

Experiments
We evaluate our model on a table-to-text generation benchmark. We denote the PIVOT model under the vanilla Seq2Seq framework as PIVOT-Vanilla, and that under the Transformer framework as PIVOT-Trans.

Dataset
We use WIKIBIO dataset (Lebret et al., 2016) as our benchmark dataset. The dataset contains 728, 321 articles from English Wikipedia, which uses the first sentence of each article as the description of the related infobox. There are an average of 26.1 words in each description, of which 9.5 words also appear in the table. The table contains 53.1 words and 19.7 attributes on average. Following the previous work (Lebret et al., 2016;, we split the dataset into 80% training set, 10% testing set, and 10% validation set. In order to simulate the low resource scenario, we randomly sample 1, 000 parallel sample, and remove the tables from the rest of the training data.

Implementation Details
The vocabulary is limited to the 20, 000 most common words in the training dataset. The batch size is 64 for all models. We implement the early stopping mechanism with a patience that the performance on the validation set does not fall in 4 epochs. We tune the hyper-parameters based on the performance on the validation set.
The key fact prediction model is a Bi-LSTM. The dimensions of the hidden units, the word embedding, the attribute embedding, and the position embedding are 500, 400, 50, and 5, respectively.
We implement two models as the surface realization models. For the vanilla Seq2Seq model, we set the hidden dimension, the embedding dimension, and the dropout rate (Srivastava et al., 2014) to be 500, 400, and 0.2, respectively. For the Transfomer model, the hidden units of the multihead component and the feed-forward layer are 512 and 2048. The embedding size is 512, the number of heads is 8, and the number of Transformer blocks is 6.
We use the Adam (Kingma and Ba, 2014) optimizer to train the models. For the hyperparameters of Adam optimizer, we set the learning rate α = 0.001, two momentum parameters β 1 = 0.9 and β 2 = 0.999, and = 1 × 10 −8 . We clip the gradients (Pascanu et al., 2013) to the maximum norm of 5.0. We half the learning rate when the performance on the validation set does not improve in 3 epochs.

Baselines
We compare our models with two categories of baseline models: the supervised models which exploit only parallel data (Vanilla Seq2Seq, Transformer, Struct-aware), and the semi-supervised models which are trained on both parallel data and unlabelled data (PretrainedMT, SemiMT). The baselines are as follows: • Vanilla Seq2Seq  with the attention mechanism (Bahdanau et al., 2014) is a popular model for natural language generation.
• Transformer (Vaswani et al., 2017) is a state-of-the-art model under the encoderdecoder framework, based solely on attention mechanisms.
• Struct-aware  is the stateof-the-art model for table-to-text generation.  Table 1: Results of our model and the baselines. Above is the performance of the key fact prediction component (F1: F1 score, P: precision, R: recall). Middle is the comparison between models under the Vanilla Seq2Seq framework. Below is the models implemented with the transformer framework.
It models the inner structure of table with a field-gating mechanism insides the LSTM, and learns the interaction between tables and text with a dual attention mechanism.
• PretrainedMT (Skorokhodov et al., 2018) is a semi-supervised method to pretrain the decoder of the sequence-to-sequence model with a language model.
• SemiMT (Cheng et al., 2016) is a semisupervised method to jointly train the sequence-to-sequence model with an autoencoder.
The supervised models are trained with the same parallel data as our model, while the semisupervised models share the same parallel data and the unlabeled data as ours.

Results
We compare our PIVOT model with the above baseline models. Table 1 summarizes the results of these models. It shows that our PIVOT model achieves 87.92% F1 score, 92.59% precision, and 83.70% recall at the stage of key fact prediction, which provides a good foundation for the stage of surface realization. Based on the selected key facts, our models achieve the scores of 20.09 BLEU, 6.5130 NIST, and 18.31 ROUGE under the vanilla Seq2Seq framework, and 27.34 BLEU, 6.8763 NIST, and 19.30 ROUGE under the Transformer framework, which significantly outperform all the baseline models in terms of all metrics. Furthermore, it shows that the implementation with the Transformer can obtain higher scores than that with the vanilla Seq2Seq.

Varying Parallel Data Size
We would like to further analyze the performance of our model given different size of parallel size. Therefore, we randomly shuffle the full parallel training set. Then, we extract the first K samples as the parallel data, and modify the remaining data as the unlabeled data by removing the tables. We set K = 1000, 6000, 30000, 60000, 300000, and compare our pivot models with both vanilla Seq2Seq and Transformer. Figure 4 shows the BLEU scores of our models and the baselines. When the parallel data size is small, the pivot model can outperform the vanilla Seq2Seq and Transformer by a large margin. With the increasement of the parallel data, the margin gets narrow because of the upper bound of the model capacity.    Figure 5 shows the curve of the F1 score of the key fact prediction model trained with different parallel data size. Even when the number of annotated data is extremely small, the model can obtain a satisfying F1 score about 88%. In general, the F1 scores between the low and high parallel data sizes are close, which validates the assumption that the key fact prediction model does not rely on a heavy annotated data.

Effect of Pseudo Parallel Data
In order to analyze the effect of pseudo parallel data, we conduct ablation study by adding the data to the baseline models and removing them from our models. Table 2 summarizes the results of the ablation study. Surprisingly, the pseudo parallel data can not only help the pivot model, but also significantly improve vanilla Seq2Seq and Transformer. The reason is that the pseudo parallel data can help the models to improve the ability of surface realization, which these models lack under the condition of limited parallel data. The pivot  Transformer: a athletics -lrb-nfl -rrb-.
SemiMT: gustav dovid -lrb-born 25 august 1945 -rrb-is a former hungarian politician , who served as a member of the united states -lrb-senate -rrb-from president to 1989 . PIVOT-Trans: philippe adnot -lrb-born august 25 , 1945 -rrb-is a french senator , senator , and a senator of the french senate .
Reference: philippe adnot -lrb-born 25 august 1945 in rhges -rrb-is a member of the senate of france . models can outperform the baselines with pseudo data, mainly because it breaks up the operation of key fact prediction and surface realization, both of which are explicitly and separately optimized.

Effect of Denoising Data Augmentation
We also want to know the effect of the denoising data augmentation. Therefore, we remove the denoising data augmentation from our model, and compare with the full model. Table 3 shows the results of the ablation study. It shows that the data augmentation brings a significant improvement to the pivot models under both vanilla Seq2Seq and Transformer frameworks, which demonstrates the efficiency of the denoising data augmentation.

Qualitative Analysis
We provide an example to illustrate the improvement of our model more intuitively, as shown in Table 4. Under the low resource setting, the Transformer can not produce a fluent sentence, and also fails to select the proper fact from the table. Thanks to the unlabeled data, the SemiMT model can generate a fluent, human-like description. However, it suffers from the hallucination problem so that it generates some unseen facts, which is not faithful to the source input. Although the PIVOT model has some problem in generating repeating words (such as "senator" in the example), it can select the correct key facts from the table, and produce a fluent description.

Related Work
This work is mostly related to both table-to-text generation and low resource natural language generation.
4.1 Table-to-text Generation Table-to-text generation is widely applied in many domains. Duboué and McKeown (2002) proposed to generate the biography by matching the text with a knowledge base. Barzilay and Lapata (2005) presented an efficient method for automatically learning content selection rules from a corpus and its related database in the sports domain. Liang et al. (2009) introduced a system with a sequence of local decisions for the sportscasting and the weather forecast. Recently, thanks to the success of the neural network models, more work focused on the neural generative models in an endto-end style (Wiseman et al., 2017;Puduppully et al., 2018;Gehrmann et al., 2018;Bao et al., 2018;Qin et al., 2018). Lebret et al. (2016) constructed a dataset of biographies from Wikipedia, and built a neural model based on the conditional neural language models.  introduced a structure-aware sequence-tosequence architecture to model the inner structure of the tables and the interaction between the tables and the text. Wiseman et al. (2018) focused on the interpretable and controllable generation process, and proposed a neural model using a hidden semi-markov model decoder to address these issues. Nie et al. (2018) attempted to improve the fidelity of neural table-to-text generation by utilizing pre-executed symbolic operations in a sequence-to-sequence model.

Low Resource Natural Language Generation
The topic of low resource learning is one of the recent spotlights in the area of natural language generation (Tilk and Alumäe, 2017;Tran and Nguyen, 2018). More work focused on the task of neural machine translation, whose models can generalize to other tasks in natural language generation. Gu et al. (2018) proposed a novel universal machine translation which uses a transfer-learning approach to share lexical and sentence level representations across different languages. Cheng et al. (2016) proposed a semi-supervised approach that jointly train the sequence-to-sequence model with an auto-encoder, which reconstruct the monolingual corpora. More recently, some work explored the unsupervised methods to totally remove the need of parallel data (Lample et al., 2018b,a;Artetxe et al., 2017;Zhang et al., 2018).

Conclusions
In this work, we focus on the low resource tableto-text generation, where only limited parallel data is available. We separate the generation into two stages, each of which is performed by a model trainable with only a few annotated data. Besides, We propose a method to construct a pseudo parallel dataset for the surface realization model, without the need of any structured table. Experiments show that our proposed model can achieve 27.34 BLEU score on a biography generation dataset with only 1, 000 parallel data.