KGLM: Pretrained Knowledge-Grounded Language Model for Data-to-Text Generation

Data-to-text generation has recently attracted substantial interests due to its wide applications. Existing methods have shown impressive performance on an array of tasks. However, they rely on a significant amount of labeled data for each task, which is costly to acquire and thus limits their application to new tasks and domains. In this paper, we propose to leverage pre-training and transfer learning to address this issue. We propose a knowledge-grounded pre-training (KGPT), which consists of two parts, 1) a general knowledge-grounded generation model to generate knowledge-enriched text. 2) a pre-training paradigm on a massive knowledge-grounded text corpus crawled from the web. The pre-trained model can be fine-tuned on various data-to-text generation tasks to generate task-specific text. We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness. Under the fully-supervised setting, our model can achieve remarkable gains over the known baselines. Under zero-shot setting, our model without seeing any examples achieves over 30 ROUGE-L on WebNLG while all other baselines fail. Under the few-shot setting, our model only needs about one-fifteenth as many labeled examples to achieve the same level of performance as baseline models. These experiments consistently prove the strong generalization ability of our proposed framework https://github.com/wenhuchen/KGPT.


Introduction
Data-to-text generation, i.e., generating textual description from structured data, is an important task with many real-world applications such as generating weather reports (Liang et al., 2009), sports news (Wiseman et al., 2017) ation models based on different strategies like softtemplate (Wiseman et al., 2018;Ye et al., 2020), copy-mechanism (See et al., 2017), content planning (Reed et al., 2018;Moryossef et al., 2019), and structure awareness Colin and Gardent, 2019) have achieved impressive results. However, existing studies are primarily focused on fully supervised setting requiring substantial labeled annotated data for each subtask, which restricts their adoption in real-world applications.
In this paper, we are interested in developing a general-purpose model that can easily adapt to different domains/tasks and achieve strong performance with only a small amount or even zero annotated examples. Our model draws inspiration from the recent wave of pre-trained language model (Devlin et al., 2019;Radford et al., 2019;Dai et al., 2019) to exploit large-scale unlabeled data from the web for pre-training. The data pairs are constructed through the following procedure. We first crawl sentences with hyperlinks from Wikipedia, and then link the hyperlinked entities to Wiki-Data (Vrandečić and Krötzsch, 2014) to find their 1-hop knowledge triples. Finally, we build a subgraph based on the linked triples. Such automatic alignment between knowledge graph and texts provides distant supervision (Mintz et al., 2009) for pre-training but it is bound to be noisy. Therefore, we design a selection strategy and only retain plausible alignments with high semantic overlap. The harvested knowledge-grounded corpus KGTEXT consists of over 1. 8M (knowledge subgraph, text) pairs, as depicted in Figure 1.
We unify the input of KGTEXT and downstream data-to-text tasks into a generalized format and design a novel architecture KGPT to encode it. We use KGTEXT to first pre-train KGPT and then fine-tune it on downstream data-to-text tasks like WebNLG (Shimorina and Gardent, 2018), E2ENLG (Dušek et al., 2019) and WikiBio . Experimental results demonstrate KGPT's several advantages: 1) with full downstream dataset, KGPT can achieve remarkably better performance than known competitive baselines, 2) with zero training, KGPT can still achieve a reasonable score on WebNLG. 3) with a few training instances, KGPT can maintain a high BLEU score while the non-pre-trained baselines only generate gibberish text. A quantitative study shows that our pre-training scheme can reduce annotation costs by roughly 15x to achieve a decent BLEU score of 30. Our contribution is summarized as follows: i). We design a distantly supervised learning algorithm to exploit large-scale unlabeled web text to pre-train data-to-text models.
ii). The proposed pre-training algorithm can bring significant performance under different settings, especially zero-shot and few-shot scenarios.

Related Work
Data-to-Text Generation Data-to-text is a longstanding problem (Kukich, 1983;Reiter and Dale, 1997), which involves generating natural language surface form from structured data. The traditional system is primarily built on a template-based algorithm. Recently, with the development of deep learning, attention has been gradually shifted to end-to-end neural generation models, which achieve significant performances on existing largescale datasets like WebNLG (Shimorina and Gardent, 2018), E2ENLG (Dušek et al., 2019), Wik-iBio (Lebret et al., 2016), ROTOWIRE (Wiseman et al., 2017), TOTTO (Parikh et al., 2020), Log-icNLG (Chen et al., 2020a), etc. However, these neural generation models are mainly focused on fully supervised learning requiring a huge amount of human annotation for the specific task. Our paper focuses on building a more generalized model architecture, which can adapt to specific tasks well with only a handful of training instances.
Knowledge-Grounded Language Modeling It is of primary importance to ground language models on existing knowledge of various forms. The neural language models (Bengio et al., 2003) have been shown to well capture the co-occurrences of n-grams in the sentences, but falls short to maintain the faithfulness or consistency to world facts. To combat such an issue, different knowledgegrounded language models (Ahn et al., 2016;Hayashi et al., 2020;Logan et al., 2019) have been proposed to infuse structured knowledge into the neural language model. These models are mainly focused on enhancing the factualness of unconditional generative models. Inspired by these pioneering studies, we explore the possibility to connect the unconditional generative model with downstream conditional generation tasks. The most straightforward knowledge-intensive conditional generative task is the data-to-text generation, which aims to verbatim given knowledge into lexical format. We demonstrate great potential of the knowledge-grounded pretraining in enhancing the model's factualness on these down-stream data-totext tasks and believe such language models can be applied to broader range of NLP tasks requiring knowledge understanding.
Pre-trained Language Model Recently, the research community has witnessed the remarkable success of pre-training methods in a wide range of NLP tasks (Devlin et al., 2019;Radford et al., 2018Radford et al., , 2019Dai et al., 2019;Liu et al., 2019b;Keskar et al., 2019;Lan et al., 2020;Lewis et al., 2019;Raffel et al., 2019). These models trained on millions or billions of data unlabeled data demonstrate unprecedented generalization ability to solve related down-stream tasks. However, the existing pre-trained text generation models (Radford et al., 2019;Keskar et al., 2019;Raffel et al., 2019) are initially designed to condition on text input, thus lacking the ability to encode structured inputs. The work closest to our concept is Switch- GPT-2 (Chen et al., 2020b), which fits the pre-trained GPT-2 model as the decoder part to perform table-to-text generation. However, their knowledge encoder is still trained from scratch, which compromises the performance. In this paper, we follow the existing paradigm to construct an unlabeled web data for LM pre-training.

Dataset Construction
The construction process has two stages, namely the crawling stage and the selection stage:

Hyperlinked Sentence Crawling
We use English Wikidump 2 as our data source. For each Wikipedia page, we split the whole paragraphs into an array of sentences and then tokenize with the nltk toolkit (Loper and Bird, 2002). We loop through each sentence to keep the sentences with more than 2 Wikipedia anchor links and within the length of 10 and 50. For each candidate sentence, we use its Wikipedia hyperlink to query WikiData (Vrandečić and Krötzsch, 2014) and obtain its corresponding entity page 3 . We retrieve the neighboring knowledge triples from these entity pages to construct a local 1-hop graph for each entity. The knowledge triples are divided into two types: 1) the object of the triple is also an entity like '(Roma F.C., country, Italy)', 2) the object of the triple is in plain text like '(Roma F.C., inception, 7 June 1927)'. In the first case, if the object entity also appears in the sentence, we use it as the bridge to build a multi-hop graph like Figure 2. After this step, we collected roughly 4 million pairs in the form of (subgraph, sentence) as the candidate for the following step.

Data Selection
We observe that the collected pairs are overly noisy with many sentences totally irrelevant to their paired subgraphs. Apparently, these pairs cannot serve our goal to build a knowledge-grounded language model. Therefore, we propose a data selection step to suppress the noise and filter out the data pairs of our interests. An example is depicted in Figure 2, the first sentence does not rely on any information provided by the knowledge graph, while the second sentence has a tight connection to the facts presented in the knowledge graph. Ideally, our proposed strategy should favor the second sentence over the first one.
To achieve this, we propose a simple lexicalbased selection strategy to perform data selection. For example, the sentence 'He was born ...' in Figure 2 has two query words 'Italy' and 'Germany', we will conduct two rounds of lexical matching. In the first round, we use 'Italy' to query its surrounding neighbors in WikiData to the neighboring unigram, i.e. '(Rome, capital, Europe, Continent, Country, Roma F.C)'. We compute the unigram overlap with the original sentence '(He, was, ...)', which is still 0%. In the second round, we use 'Germany' to do the same computation and calculate the lexical overlap, which is still 0%. So the final averaged grounding score of two rounds is 0%. We can follow the same procedure to compute the grounding score for the second sentence in Figure 2 with four rounds '(AS Rome, FB, Rome, Italy)'. The grounding score is above 30%, which indicates that the sentence is highly grounded on WikiData subgraph. In this paper, we use a threshold of 0.13, which selects the top 7M 'good' sentences from the original 12M Wikipedia corpus.  After the selection step, we obtain a denoised knowledge-grounded corpus KGTEXT for pretraining. However, there still exist noisy false positives in the corpus, for example, a subgraph contains triple '(Roma F.C., country, Italy)', which is associated with the text 'An Italian player plays for A.S. Roma'. Though the two entities co-occur, they are not meant to describe the fact triple. By applying more strict rules, we can suppress such false positives, but the data capacity could significantly drop consequently. We experimented with differ-ent thresholds to balance noise and data capacity and finally decide on a threshold with an acceptable noise degree. The detailed statistics of the KGTEXT is listed in Table 1. We held-out 10,000 sentences for both validation and testing to evaluate the pre-trained model.

Model
We formally define the problem setting and KGPT's architectures in this section.

Problem Setting
In this paper, we consider inputs from structured data with diverse formats, like knowledge subgraph in KGTEXT, dialog act in E2E ( Here we unify them into a generalized dictionary format, which uses keys to represent subjects and values to denote the predicate-object pairs following the subject. We showcase the conversion criteria from structured inputs in different data-to-text datasets into our generalized format in Figure 3. The generalized input is denoted as X, and the output is denoted as y. Our model encodes X into a sequence of dense vectors, and then uses the decoder to attend and generate y.

Encoder
The encoder network is crucial to our model to capture the highly structured graph input. We mainly experiment with two types of encoders: Graph Encoder This encoder is mainly based on graph attention network (Li et al., 2016;Kipf and Welling, 2017;Veličković et al., 2018) to explicitly encode the structure information. Specifically, we view each object, predicates, and subjects as the leaf nodes, and add [ENT], [TRIPLE] as pseudo nodes for message passing purposes. The built graph is depicted in Figure 4. First of all, we initialize the node representation with the averaged embedding of its subword units. For example, the node 'Moses Malone' has a representation of (E[Mos] + E[es] + E [Ma] + E[lone]) / 4 with E denoting the embedding. After we obtain the initial node representation, we use message propagation to update the node representations based on neighboring information.
In the first layer, we exchange the information between nodes inside a triple, e.g., 'Moses Malone' receives message from siblings 'Gender' and 'Male'. In the second layer, we aggregate information from sub/pred/obj nodes to the [TRIPLE] node, e.g., '[TRIPLE1]' receives message from children 'Moses, Gender, Male'. In the third layer, we aggregate the information from different [TRIPLE] to the [ENT] node. In the fourth layer, we exchange information between different [ENT] nodes to enhance cross-entity interactions. Formally, we propose to update the representation of the i-th node g i ∈ R D with the multi-head attention network, which aggregates information from neighboring nodes g j ∈ N i as follows: where m denotes the m-th head in the attention layer, W m Q , W m K , W m V ∈ R D×D are the matrices to output query, key, value vectors for m-th head. The attention output v and the residue connection from g i are fed through the final MLP and LayerNorm to update i-th node representation aŝ g i . The output of graph encoder is denoted as G ∈ R n×D = {g 1 , · · · , g n } with n nodes.
Sequence Encoder This encoder is mainly based on transformer (Vaswani et al., 2017) with special embedding as an auxiliary input to infuse the structure information to the sequence model. The concept of special embedding was initially proposed by BERT (Devlin et al., 2019), more recently, it has been adopted by Herzig et al. (2020) to infuse structural information. We visualize the embedding layer in Figure 5, where we leverage additional entity embedding, triple embedding, and property embedding to softly encode the structure of the subgraph as a linearized sequence. For example, the entity embedding can inform the model which entity the current token belongs to, while the triple embedding can indicate which triple the current token belongs to and the property embedding indicates whether the token is a subject, predicate, or a subject. Such an encoding mechanism is designed to softly encode the graph structure into the embedding space for further self-attention. Compared <triple> Stuart_Parker_(footballer) | club | Chesterfield_F.C. <triple> 1_Decembrie_1918_University | nickname | Uab.   TRIP1  TRIP1  TRIP1  TRIP1  TRIP1  TRIP2  TRIP2  TRIP2  TRIP2  TRIP2  PAD PAD PAD    to the graph encoder, the sequence encoder does not enforce the structure as a hard constraint and allows more flexibility for the model to perform cross-triple and cross-entity interactions. Formally, the dot-product self-attention follows the definition of Transformer (Vaswani et al., 2017): where Q, K, V are the computed from the input embedding, m represents m-th head and f att is the core attention function, the final output is denoted as G ∈ R n×D with n denoting the sequence length.

Decoder
Our decoder architecture is mainly based on Transformer (Vaswani et al., 2017) and copy mechanism (See et al., 2017). At each decoding time step, the model has a copy gate p gen to select y i should be generated from the vocabulary w ∈ V or copied from the input tokens x: where o i is the last layer hidden state of the decoder at i-th time step, α j is the copy probability over the whole input token sequences x.

Optimization
As we have defined our encoder-decoder model, we will simply represent it as p encdec (x) to output a distribution over word y i ∈ V at the i-th time step. During pre-training, we optimize the log-likelihood function on D KGT ext . After pre-training, we convert the downstream task's input into the defined dictionary format and denote the dataset as D down , and then further optimize the log-likelihood objective with θ initialized from the pre-training stage. The pre-train and fine-tuning procedure is displayed in Figure 6, where we first use KGTEXT to pre-train KGPT, and then fine-tune with different types of inputs using the standard auto-regressive log-likelihood objective.

Experiments
We experiment with three different down-stream tasks, which covers various table-to-text applications to verify the generalization capability of KGPT. Besides the fully supervised learning, we also evaluate zero-shot and few-shot learning.

Experimental Setup
We apply the standard GPT-2 (Radford et al., 2019) tokenizer from Hugginface Github 5 to tokenize the text input, which has a vocabulary of over 50K subword units. We test with both graph encoder and sequence encoder. We set their hidden size to 768 and stack 6 layers for both encoder and decoder with 8 attention heads. During pre-training, we run the model on KGTEXT on 8 Titan RTX GPUs with a batch size of 512 for 15 epochs using Adam (Kingma and Ba, 2015) optimizer with a learning rate of 1e-4. The pre-training procedure takes roughly 8 days to finish. We use a held-out validation set to select the best checkpoint. During fine-tuning, we use a learning rate of 2e-5. In our following experiments, we compare with the known best models from different datasets. As none of these models are pre-trained, we also add Template-GPT-2 (Chen et al., 2020a) and Switch-GPT-2 (Chen et al., 2020b) as our pre-trained baselines. Both models apply GPT-2 (Radford et al., 2019) as the generator to decode description from a table. For the ablation purposes, we list the performance of all non-pre-trained KGPT to see the performance gain brought by pre-training alone. All the best models are selected based on the validation set score, and the numbers are reported in the following tables are for test split. For evaluation, we report the performance with BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE-L (Lin, 2004) using e2e-metric 6 . It's worth noting that we perform comprehensive data contamination studies in the following experiments to make sure the pre-training data contains very little overlap with the test split in downstream tasks. We filter out potentially information-leaking pages during the data crawling process.

Preliminary Study on KGTEXT
In the preliminary study, we evaluate our pretrained model's performance on the held-out set of KGTEXT to conduct ablation study over KGPT. Specifically, we investigate 1) which encoding mechanism is better, 2) whether we need copy mechanism or copy supervision. As demonstrated in Table 3, we observe that the trivial difference between two encoder designs. With the copy mechanism, KGPT can greatly decrease the perplexity. However, supervising the copy attention does not have much influence on the performance. Therefore, in the following experiments, we will run experiments for both encoding schemes with a copy mechanism without copy loss.

Fully-Supervised Results
We experiment with KGPT under the standard fully-supervised setting to compare its performance with other state-of-the-art algorithms.
WebNLG Challenge We list WebNLG's experimental results in Table 4, here we compare with the known models under the unconstrained setting. The baseline models (Shimorina and Gardent, 2018) uses sequence-to-sequence attention model (Luong et al., 2015) as the backbone and propose delexicalization and copy mechanism to enhance model's capability to handle rare items from the input. The GCN model (Marcheggiani and Perez-Beltrachini, 2018) uses graph convolutional neural encoder to encode the structured data input. Its implementation is from Github 7 . As can be seen, KGPT without pre-training already achieves better performance than the GCN baseline. With pre-training, the performance is further boosted by 1-2 BLEU-4, which reflects the effectiveness of our method.

E2E Challenge
We list E2ENLG's experimental results in Table 5, here we compare with the state-of-the-art systems on the leaderboard of E2E challenge 8 . These baselines methods are based on neural template model (Wiseman et al., 2018), syntax-enhanced algorithms (Dušek and Jurcicek, 2016), slot alignment (Juraska et al., 2018) and controlling mechanism (Elder et al., 2018). As is seen from the table, KGPT can beat the SOTA systems by a remarkable margin. Overall, the improvement brought by pre-training is roughly 0.5-1.0 in terms of BLEU-4, which is less significant than WebNLG. Such a phenomena is understandable given that this dataset contains limited patterns and vocabulary in the input meaning representation, a full training set over 40K instances is more than enough for the generation model to memorize. In the following few-shot experiments, we will show the strength of KGPT to generate high-quality faithful descriptions with only 0.1% of training data.   . However, in the few-shot setting, we will show the 25+ BLEU gain brought by pre-training.

Few-Shot Results
The few-shot learning setting aims to study the potential of the proposed pre-training to decrease annotation labor in data-to-text generation tasks. Under this setting, we not only compare with nonpre-trained baselines to observe how pre-training can benefit the model's few-shot learning capability but also compare with other pre-trained LM (Chen et al., 2020b,a) to see the benefit of KGPT over existing pre-trained LM.   WebNLG & E2ENLG Dataset In these two datasets, we use 0.1%, 0.5%, 1%, 5%, 10% of training instances to train the model and observe its performance curve in terms of BLEU-4. For WebNLG challenge, the few-shot situation will pose a lot of unseen entities during test time. From Table 7, we can observe that the delexicalization mechanism can remarkably help with the few-shot situation. However, the improvement brought by delexicalization is much weaker than our proposed pre-training. Under the 5% setting, while the non-pre-trained baselines are only able to generate gibberish text, pre-trained KGPT can maintain a high BLEU score over 40.0 due to its strong generalization ability.
For E2E challenge, the task is comparatively simpler with rather limited items. From Table 8, we can observe that TGen (Dušek and Jurcicek, 2016) is achieving similar performance as our nonpre-trained KGPT, they both perform quite well even under 1% training instances. However, after we further reduce the training samples to roughly 0.1%, the baseline models fail while pre-trained KGPT still maintains a decent BLEU over 40.0.
WikiBio Dataset In this dataset, we adopt the same setting as Switch- GPT-2 (Chen et al., 2020b) and Pivot (Ma et al., 2019) to use 50, 100, 200 and 500 samples from the training set to train the generation model. From the results in Table 9, we observe that KGPT can achieve best scores and out-perform both Template-GPT-2 and Switch-GPT-2 under most cases. Though Template-GPT-2 is getting slightly better score with 500 training samples, the overall performance on three datasets are remarkably lower than KGPT, especially under more extreme cases. It demonstrates the advantage of our knowledge-grounded pre-training objective over the naive LM pre-training objective.  Quantitative Study We further investigate how much sample complexity KGPT can reduce. Specifically, we specify a BLEU-4 score and vary the training data size to observe how much training samples are required to attain the performance. We specify BLEU=30 as our standard and display our results in Table 10. We compute the ratio of  sample quantity to characterize the benefits from pre-training. Roughly speaking, pre-training can decrease the sample complexity for training by 15x, which suggests the great reduction rate the annotation cost with pre-trained KGPT to achieve the desired 'promising' performance.

Zero-Shot Results
We further evaluate KGPT's generalization capability under the extreme zero-shot setting and display our results for WebNLG in Table 11. As can be seen, all the non-pre-trained baselines and Template-GPT-2 fail under this setting, while KGPT can still manage to generate reasonable outputs and achieve a ROUGE-L score over 30. Given that no input knowledge triples in WebNLG were seen during pre-training, these results reflect KGPT's strong generalization ability to cope with out-of-domain unseen knowledge inputs.

Human Evaluation
We conduct human evaluation to assess the factual accuracy of the generated sentences. Specifically, we sample 100 test samples from WebNLG and observe the model's factual consistency with given fact triples. We use AMT to distribute each generated sentence to four high-quality workers (95% approval rate, 500+ approved jobs) to choose from the three ratings. The majority voted rating is the final rating. We compare four different systems, i.e., non-pre-trained and pre-trained KGPT. Conditioned on the fact triples, we categorize the generated samples into the following categories: 1) hallucinating non-existing facts, 2) missing given facts without hallucination, 3) accurate description of given facts. We visualize the results in Figure 7, from which we observe that pre-trained KGPT are less prone to the known hallucination issue and generate more accurate text. The human evaluation suggests that pre-training can enhance the model's understanding over rare entities, thus reducing the over-generation of non-existent facts.

Conclusion
In this paper, we propose a pre-training recipe to exploit external unlabeled data for data-to-text generation tasks. Our proposed model has achieved significant performance under zero-shot and fewshot settings. Such a framework provides a plausible solution to greatly reduce human annotation costs in future NLG applications.

A Learning Curve
Here we observe the learning trend of both non-pretrained and pre-trained models by evaluating the validation BLEU at each epoch end, here we show our findings in Figure 8. As can be seen from the figure, the pre-trained model converges much faster to the best score. More specifically, it only takes 20 epochs for the model to reach BLEU-4 over 60 while it takes 80-90 epochs for a non-pre-trained model to reach equivalent performance.

B Predicate Distribution
Here we demonstrate the most popular predicates in Figure 9. As can be seen, the most popular predicates are 'instance of', 'occupation', 'country', 'located in', etc. There are over 1000 predicates in our dataset, which covers the commonly seen categories in different domains like politics, athletics, music, news, etc.

C Case Study
Here we demonstrate some empirical study over the generated samples from our models in Figure 10. As can be seen, KGPT has developed a really strong generation capability to output fluent and coherent sentences. In the first line, the decoded sentence is mostly correct, just the name of 'municipality' should be 'Belgrade' rather than 'Zemun' itself according to https://www.wikidata. org/wiki/'Q189419. In the second line, the sentence is mostly correct, the error comes from the end date of Annibale. The third sentence is completely correct. The fourth sentence also suffers from a factual error, the relationship should be 'married' rather than 'daughter'.
From these sentences, it's understandable that the model can achieve reasonable zero-shot performance on the WebNLG dataset given that WebNLG also comes from a similar domain. The case study reveals that our generation model though generates fluent and relevant sentences from the given knowledge triples, the groundedness is still questionable with quite an amount of hallucination issues.

Reference
Alys is known to have married Sir John Scudamore a sheriff of Herefordshire .

Decoded
The Iran national basketball team represents Iran in international basketball and is controlled by the Islamic Republic of Iran Federation of Basketball Iran.

Reference
The Iranian national basketball team represents Iran in international basketball competitions , and is controlled by the IR Iran Basketball Federation .

Reference
The Fort Selkirk Volcanic Field in central Yukon is the northernmost Holocene volcanic field in Canada , including the young active cinder cone , Volcano Mountain . Figure 10: Randomly generated samples from KGTEXT, where the inputs are the WikiData entities, you can search it online to see it information. For example, the entity 'Q403' and its fact triples can be seen from https: