Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Generating natural sentences from Knowledge Graph (KG) triples, known as Data-To-Text Generation, is a task with many datasets for which numerous complex systems have been developed. However, no prior work has attempted to perform this generation at scale by converting an entire KG into natural text. In this paper, we verbalize the entire Wikidata KG, and create a KG-Text aligned corpus in the training process. We discuss the challenges in verbalizing an entire KG versus verbalizing smaller datasets. We further show that verbalizing an entire KG can be used to integrate structured and natural language data. In contrast to the many architectures that have been developed to integrate the structural differences between these two sources, our approach converts the KG into the same format as natural text allowing it to be seamlessly plugged into existing natural language systems. We evaluate this approach by augmenting the retrieval corpus in REALM and showing improvements, both on the LAMA knowledge probe and open domain QA.


Introduction
Data-To-Text Generation (Kukich, 1983;Goldberg et al., 1994) involves converting knowledge graph (KG) triples of the form (subject, relation, object) into a natural language sentence(s). There are many standard datasets for this task such as WebNLG (Gardent et al., 2017) and many systems have been developed to improve performance on these datasets. However, to the best of our knowledge, no prior work has attempted to do this at scale i.e. verbalize a full knowledge graph. In this paper, we convert the English Wikidata KG (Vrandečić and Krötzsch, 2014) Figure 1: An example of generating text from KG. The triples on the left were grouped by our system and converted to the sentence on the right.
Our generated corpus consists of ∼18M sentences spanning ∼45M triples with ∼1500 distinct relations. We discuss the challenges associated with this verbalization in comparison to existing Datato-Text datasets (e.g. WebNLG). These challenges include entity and relation coverage and lack of grouped sets of triples that can produce coherent sentences together. We create an English Wikidata KG-Wikipedia Text aligned corpus, consisting of a variety of entities such as dates and numerical quantities, to train our system. We release both the aligned corpus and the generated verbalized KG. We call the generated corpus as the KELM Corpus (Corpus for Knowledge-Enhanced Language Model Pre-training).
We evaluate the quality of the generated corpus by performing a human evaluation on a random sample. We further showcase the utility of this corpus in language model pre-training. Text represents a limited coverage of the world knowledge. Therefore, we expect the language models to be restricted to facts that are expressed in natural language. Moreover, facts may not be expressed as explicitly in text as they are in KGs, and the variability in the quality of text can eventually cause biases in the resulting models (Bolukbasi et al., 2016;Sheng et al., 2019;Manzini et al., 2019). Building models that handle structured data and free form text seamlessly has been a long sought-after goal. However, their integration is challenging due to different structural formats. KG verbalization provides a simple way to integrate KGs with natural text. We illustrate this by augmenting the REALM (Guu et al., 2020) retrieval corpus with the KELM Corpus. We evaluate this augmented system on the LAMA knowledge probe (Petroni et al., 2019) and open domain QA and show improvements on both.
Through ablation experiments where we augment the retrieval corpus with the raw triples instead, we further confirm the effectiveness of verbalization. Finally, we provide an extensive discussion on other potential applications of this corpus such as developing factually consistent models and reducing offensive content generation. Our contributions are as follows - There are currently several datasets of varying complexity and slightly different objectives for the Data-To-Text Generation task, also known as Table-to-Text, Database-To-Text, Graph-To-Text and Concept-To-Text. WebNLG 2017 and 2020 (Gardent et al., 2017) involve generating natural language text given a set of one to seven triples whereas E2ENLG (Dušek et al., 2018) consists of either a picture or a set of (object, value) pairs as input. WikiBio (Lebret et al., 2016) involves generating the biography of a Wikipedia entity, given its infobox. Other datasets utilize tables -Wiseman et al. (2017) involves generating textual descriptions of basketball games from score statistics tables. For ToTTo (Parikh et al., 2020) and DART (Radev et al., 2020), the input is a table with relevant highlighted cells, and the goal is to generate text describing the highlighted cells.

KG-Text alignment
T-REx (Elsahar et al., 2018) is the most widely used Text-KG aligned corpus. It consists of 11M Wikidata KG triples aligned to 6M Wikipedia sentences. It uses complex systems such as coreference resolution and predicate linkers to generate a clean corpus for alignment. In comparison, we use entity alias-based heuristics coupled with source text selection restrictions to generate a corpus of 16M triples aligned with 8M sentences. While some of our sentences are noisier than T-REx in some aspects, we also improve upon some of its alignment errors that we discuss later. Our goal was to get a large coverage of various relation types to train a system to convert an entire KG to text. Moreover, since our inference relies on the set of the triples having the same subject entity, we generated the training corpus with the same property.
A concurrent work to ours (Chen et al., 2020) created an aligned corpus for pre-training a datato-text model. Their alignment procedure is significantly different from ours. They use Wikipedia hyperlinks to determine KG entity occurrences in full Wikipedia text.  had also created a corpus using hyperlinks and coreference resolution. Reliance on hyperlinks would miss out on dates, quantities and KG entities that do not have a Wikipedia page, and hence relations such as date of birth, occupation, publication year and distance from Earth. On the other hand, our alias-based matching covers these wide variety of relations.
There is a vast literature on the inverse task of automatic KG construction from text (Etzioni et al., 2008;Angeli et al., 2015;Clancy et al., 2019), however these works generally describe the methodology and do not release the corresponding dataset.  Figure 2: KG verbalization process. First, a Wikidata KG triples-Wikipedia Text aligned corpus is created. Next, T5 is finetuned sequentially first on this corpus, followed by a small number of steps on the WebNLG corpus. Triples from the entire KG are grouped together for inference using the relation alignment statistics from the aligned training corpus. The grouped triples are then converted into natural text using the trained model. The generated sentences are further filtered for quality using a BERT finetuned on human assigned semantic scores.

Incorporating KGs
Most prior works on incorporating KG with text often learn KG entity representations and add them to the mention spans linked to the entity Févry et al., 2020). Some works employ different techniques or incorporate additional modules. Verga et al. (2020) extend (Févry et al., 2020) by adding a triple memory in addition to an entity memory. Object value vectors are retrieved from this triple memory using a joint encoding of the relevant (subject, relation) pair and incorporated in the final representation.  also retrieves triples for the given sentence but do not add them to entity spans. They maintain a local KG for entities mentioned in text and their relations. The model either generates a non-entity, or an entity from the local KG or a new entity conditioned on the local KG, adding it to the graph for future generation for this sentence. (Das et al., 2017) use universal schema (Riedel et al., 2013) that embeds text and KGs in a shared space for their integration. (K M et al., 2018) perform pre-training to learn a single representation for all the triples mentioned in a sentences. The pre-training stage generates an entity and a relation embedding for a given sentences as attention over the full set of entities in the KG. This is further updated during finetuning along with the task specific classification objective. In contrast, we do not learn any entity representations or perform entity linking. We convert the KG into text and use it to augment the pre-training data.

KG Verbalization
In this section, we discuss the full system of converting the KG into natural text. We compare this system to one trained on just WebNLG data using the same model architecture to illustrate the difference in performance quantitatively. Verbalizing a full KG has several challenges compared to a smaller, domain-specific dataset such as WebNLG: • Coverage: Many more entities (∼6M vs ∼600) and relations (∼1500 vs ∼20) • Ungrouped triples: WebNLG has a set of triples for which the goal is to generate text. It is known that these triples together can provide coherent text. In a full KG, however triples are not grouped together already.

Text-KG alignment
We first create training data for the task by aligning Wikidata triples to Wikipedia text (see Figure  3). For each subject entity, we restrict ourselves  to the root section of its Wikipedia page because this section generally describes the relations of the subject entity with other entities. For each sentence in this section, we match all triples that have this entity as the subject. A triple is said to match if any of the aliases of the object entity matches. We do not match relations as there are too many ways to express them. Restricting to the subject entity's page and root section generally ensures that the relation is expressed in the sentence if it mentions the object entity. Each triple can align to multiple sentences and each sentence can have multiple triples aligned to it. If any alias of the subject entity matches to given sentence, the sentence is selected as is, else the first animate third-person personal or possessive pronoun is replaced by the subject entity's canonical name. The pronoun replacement heuristic also works well because of the subject entity page and root section restriction. All triples aligned to a given sentences are combined together as a single example.
We extract several types of triples, each of which have slightly different matching techniques -  While the types of entities are important in the alignment process, the verbalization model is agnostic to the type and treats all triples the same. Alignment statistics are shown in Table 1 and some alignment examples in Table 2. There are a total of ∼45M triples, ∼35% of which were aligned to a sentence. In some cases, multiple triples aligned to the same sentence, resulting in ∼8M examples, covering ∼42% of the relations.
This aligned corpus is somewhat noisy because each sentence consists of only one subject entity and its relations. This error does not exist in T-REx due to the use of NLP pipelines. We still chose to make this restriction so as to have the same propert as inference since we group triples by subject entity during inference. This also allowed for the alignment process to be relatively simple. However, we believe these errors should be minimal because of the restriction to the entity page and The blue whale (Balaenoptera musculus) is a marine mammal belonging to the baleen whale suborder Mysticeti. root section. Due to this restriction, there was also no need to match relations to the text and it avoided issues such as incorrect entity linking and incorrect entailment. The T-REx paper gives an example where "Virginia" in "Carolyn Virginia Wood (born December 18, 1945) is an American" is linked to the state Virginia and the sentence "Ernst Gustav Kuhnert was born in Tallinn, Estonia" gets aligned to (Tallinn, Capital of, Estonia). Our corpus does not have such errors due to the root section restriction. In the first example, an error would occur only if this entity has a relation with the object entity as the state Virginia and in the second case, we would not extract this triple since all triples in this sentence would have "Ernst Gustav Kuhnert" as the subject entity.

Training
We finetune the pretrained T5 (Raffel et al., 2020) large model for converting triples to text. Triples are concatenated as subject relation 1 object 1, ....relation n object n and fed as input to T5. We perform sequential finetuning -first on the aligned corpus described in the previous section for 5000 steps. This increases the coverage of entities and relations. However, this also results in the generation of Wikipedia-like sentences and hallucination if a certain expected input triple is missing.  We use a learning rate of 0.001, a batch size of 1048576 tokens and a maximum decoding length of 256.

Inference
Training data has multiple triples aligned per sentence. Using single triples as input during inference can lead to hallucination. So we develop a grouping strategy for triples where triples for a given entity are grouped based on co-occurrence counts i.e. frequency of alignment to same sentence in the training data. We use greedy aggregation for a maximum depth of 5 with a cutoff of co-occurrence count as 5 to avoid any noisy alignments. For example, for a given subject entity s, first select a triple (s, r i , o i ) for this entity. Next, from the set of remaining triples of the form (s, r, o), select (s, r j , o j ) such that r j has highest co-occurrence count with r i out of all r. Next, select (s, r k , o k ) such that r k has highest co-occurrence count with r j out of all r and continue doing for a maximum length of 5. Triples that do not get aggregated are fed one at a time. The aggregation produces 18M groups from 45M triples i.e. the final corpus will have 18M generated sentences. We perform top 5 sampling with a temperature of 0.5.

Quality Filtering
We perform a semantic quality based filtering on the generated corpus. A semantic quality score is assigned to each generated sentence w.r.t. the input triples that denotes if the generated text captures the full meaning of the triples and does not hallucinate extra information. The bottom 1% of the generated sentences are filtered out based on this score. The score is generated by using a BERT   Table 5: Human evaluation of the generated corpus on a scale of 1-5. W refers to finetuning only on WebNLG and SEQ refers to finetuning on the aligned corpus followed by WebNLG. U refers to inference on one triple at a time and G refers to inference on a set of triples, grouped by the described strategy.
base uncased model finetuned for 1000 steps on the WebNLG 2017 human assessment data. System predictions from the WebNLG 2017 challenge were rated on a scale of 1-3 for semantics and fluency. We use the semantics score and scale it to 0-1. We also add gold references with a score of 1. This results in 2706 examples, 90% of which are used for finetuning and the remaining for evaluation. High correlations are obtained between the predicted scores and human scores on the evaluation split (see Table 3).

Human Evaluation
Generation quality of the KELM Corpus is evaluated using human ratings on a sample of 200 random instances of grouped triples. Automatic metrics such as BLEU (Papineni et al., 2002) or BERTscore (Zhang et al., 2019) cannot be used due to the lack of gold references. The generated text is rated for two aspects-fluency and semantics on a scale of 1-5, where 1 means not at all fluent/does not capture meaning at all and 5 means completely fluent/fully captures meaning with no hallucination 3 . We compare our system to a T5 model which is trained on just WebNLG 2017 training data. For Baseline, we use the same 200 instances for evaluation but without grouping them as grouping was a part of the system. This results in 524 ungrouped triples that are rated on the same scale for both the aspects. For Baseline++, we perform grouping during inference. Scores are shown in Table  5 and some examples of generation using the two systems are shown in Table 4. The final system has higher averages and less variation in scores. It also paraphrases canonical names of relations in the KG to more natural expressions more often.

Applications
In this section, we showcase an application of the generated synthetic KELM Corpus. Language models are trained on large natural text corpora such as Wikipedia or common crawl. KGs are a rich source of factual information that can serve as additional succint informatopm. However, the different structure of text and KGs make it hard to integrate the two. We propose verbalization of KGs as a simple method to incorporate KGs into pre-training. Specifically, we augment an existing model with the KELM corpus and show gains on LAMA knowledge probe and open domain QA. We also perform experiments where we integrate raw triples instead of verbalized KG to confirm the benefits of verbalization.

Integration in REALM
REALM (Guu et al., 2020) introduced a way to build more interpretable language models using a retrieve and read paradigm. It uses two corpora for pre-training -retrieval corpus and pre-training corpus. During pre-training, a sentence is selected at random from the pre-training corpus, a random word or salient span (dates and entities) is masked in this sentence, then using a joint representation of the masked sentence and each of the documents in the retrieval corpus, the masked word is predicted. In the finetuning stage, the model is provided with a query/question as input in place of masked sentence from the pre-training corpora, retrieves a small set of documents from the retrieval corpus based on the vector similarity and selects a span of text from the retrieved documents as the answer. A similar system RAG (Lewis et al., 2020) uses the same paradigm but generates the answer from the representation of the selected documents instead of selecting a span verbatim even during finetuning. We merge sentences in the KELM corpus by subject entities to create 5722974 documents and then replace/augment the retrieval corpus in REALM with these synthetic documents. KELM Corpus has ∼286M words as compared to English Wikipedia which has ∼2B words.

Datasets
NaturalQuestions (NQ) (Kwiatkowski et al., 2019) has queries that were issued to Google and their answers.
WebQuestions (WQ) (Berant et al., 2010) has question-answers collected using the Google Suggest API. We keep the same settings as REALM for both NQ and WQ i.e. we work on the open domain setting for both datasets where no passage is pro-vided as context for each question. Finetuning is performed on respective training splits.
REALM was developed with the goal of providing an interpretable mechanism for knowledge intensive tasks. However, the model was not evaluated on LAMA. We first evaluate REALM on LAMA using the original retrieval corpus and then using the KELM corpus. No finetuning is performed and the masked word predictions from the pretrained models are used as answers.

Results
We evaluate REALM on WQ, NQ and LAMA under three settings by modifying the retrieval corpus i) Original: Wikipedia text, ii) Replaced: only KELM Corpus, and iii) Augmented: Wikipedia text + KELM Corpus. The Replaced and Augmented models are evaluated using both the raw triples and the generated sentences. The model is pre-trained for 200k steps with the CCNews pre-training corpus in all cases with default hyperparameters.
For the two open domain QA datasets, we finetuned the pretrained REALM again on the respective training splits. While we were able to reproduce the accuracy on WQ, the accuracy we obtained on NQ was around 1.5% absolute less than the reported accuracy (see row 1&2 in Table 7). LAMA probe was not used in the REALM paper as one of the evaluation datasets so we first evaluated the pretrained REALM on LAMA, reporting the results on the different sub-corpora in Table  6 (row Wikipedia under REALM). Even the original REALM model shows significant improvement over prior models. The architectural paradigm of REALM where it can access the corpus documents even during inference not only make it interpretable but also better on the knowledge intensive tasks. It obtains 67.36% accuracy on Google-RE, 68.18% on T-REx and 27.96% on SQuAD. In comparison, the reported accuracy for BERT is 10.50% on Google-RE, 32.30% on T-REx and 17.40% on SQuAD. BERT performs better on 1-1 T-REx relations with 74.50% accuracy as compared    Table 7), the performance on WQ is at par with original model. However, the performance on NQ is much lower. This can be attributed to the nature of the datasets -WQ is a KG QA dataset whereas NQ consists of queries issued to Google. On LAMA (rows 2&3 under REALM in Table 6), the performance is lower than the original REALM but much higher than BERT. We compare the two settings by using the raw triple or the sentences. When using just the KELM Corpus, the format doesn't matter. Both raw triples and sentences have similar performance. However, a system trained on raw triples may not generalize for tasks where sentence structure is important.
Finally, we evaluate the Augmented model which uses both the Wikipedia text and the synthetic KELM Corpus for retrieval. Results are shown in the last two rows of Tables 6 and 7. We observe improvements on all the datasets with this Augmented model. There is an absolute gain of 2.63% and 3.10% on NQ and WQ respectively over the original Wikipedia text only model. Similarly, there is an absolute gain of 12.94%, 0.95%, 3.61% and 0.47% on Google-RE, T-REx, SQuaD and ConceptNet in LAMA respectively. We again compare the two settings where we augment with raw triples or the generated sentences. The improvement is higher when the generated sentences are added instead of the raw triples, confirming the effectiveness of verbalization of KG into natural language sentences.
We inspected some of the errors made by the Augmented model on LAMA and they can be broadly classified into four categories -1. Ambiguous Query: e.g. In "X was born in ", the answer could be the year of birth or the place of birth but only one of them is acceptable depending on the subcorpus the particular query appears in.
2. Incomplete Answer Set: e.g. In "Konstantin Mereschkowski had a career as ", the gold target is biologist and the system predicted botanist but both should be correct.
3. Answer granularity: There are cases where the system predicts a more specific answer and is actually correct. e.g. In "On the CPI scale, Kenya ranks ", the gold answer is low but the system predicted 101, which is in fact correct. 4. Actual errors: These are actual incorrect answers by the systems.

Discussion And Future Work
In this paper, we converted an entire Knowledge Graph into natural text, tackling various challenges in comparison to verbalizing smaller datasets. We further incorporated this generated corpus into a language model, specifically as a retrieval corpus into the read-and-retrieve architecture of REALM.
We evaluated the augmented model on open domain QA and a knowledge probe, showing improvements on both.
Several recent works have explored integrating KGs into natural text datasets. The ability to seamlessly integrate KGs into natural text datasets has enormous benefits stemming from the complementary nature of the two data sources. Each of them are incomplete on their own, as KGs often miss holistic elements and natural text often expresses facts sparsely and implicitly. Prior works have mostly focused on learning simultaneous representations of both data sources during pre-training. This requires building larger models and in some cases, maintaining the structure of the KG . Our solution is simpler, since converting a KG into natural text alleviates the difference in structure, and allows KG information to be incorporated into any model without architectural changes or the need to scale up parameters.
The KELM corpus we release provides many advantages both as a standalone dataset as well as a complementary dataset to existing systems. Perhaps the most obvious advantage is that the KELM corpus is derived directly from a KG, and is therefore more factually accurate than almost any other existing large text corpus. Most text corpora today are crawled from the Web and contain a lot of inaccurate information, including stale or contradictory facts. KGs such as Wikidata, on the other hand, are constantly updated and contain almost exclusively factually accurate and up-to-date information.
Another advantage the KELM corpus has over other text corpora is the absence of offensive content. One of the biggest challenges in the NLP community today is the development of systems that do not generate toxic text or learn spurious correlations. Since most models are trained on data crawled from the Web, it is highly likely they are exposed to objectionable text. However, training a model on the KELM Corpus avoids this risk since it is derived from a purely fact-based data source. In its current state, the KELM Corpus may be insufficient to completely replace larger Web-based corpora. But if larger KGs are verbalized, or several KGs from various sources are verbalized together, the resulting datasets could potentially contain all of the useful knowledge of a Web-based corpus without any of the detrimental offensive content.
The KELM corpus could also provide an advantage in mitigating bias in models, which is another major challenge in the NLP community. Wikipedia has documented ideological 5 , gender 6 , and racial 7 biases in its text. While the KELM corpus may still contain some of these biases, certain types of biases may be reduced. For example, text describing a particular war and how it started can easily contain implicit biases. However, a KG would only contain factual information about the war, such as the dates fought, participant countries, and location. For many applications, this factual information is sufficient and incorporating opinions in text only leads to bias. Coverage biases may still exist in KGs since they are curated by editors, and lesser known topics could be missing. A study on comparative bias of such a system is a future direction to pursue.
Another future direction related to this work could be expanding the generation to multi-hop relations. The generation in this paper is restricted to a given entity and its relations to other entities. This could be extended to multi-hop relations in order to generate more complex sentences. Since the current generation method covers all facts in the KG even if they are not expressed in the same sentence, it remains to be explored if the multi-hop generation would be useful. For example, suppose we have two sentences -"X is a child of Y" and "Y is a child of Z". If a system can infer that this means X is a grandchild of Z, then multi-hop might not be beneficial. However, if it is not able to infer this fact, a system that does multi-hop generation from a KG could be useful.
Extending the work in this paper to beyond the English language would be another exciting future direction. Recent work has shown promising results on generating multilingual text from English triples 8 . Our proposed approach can be applied to generating a multilingual corpus of facts in various languages using English Wikidata.
Finally, it remains to be explored how KELM performs on linguistic tasks such as part-of-speech tagging and dependency parsing. While the generated sentences are fluent and grammatical, they may not cover many complex sentence structures, making the corpus useful for augmentation but not fully replacing pre-training data, yet.