Enhancing Neural Data-To-Text Generation Models with External Background Knowledge

Recent neural models for data-to-text generation rely on massive parallel pairs of data and text to learn the writing knowledge. They often assume that writing knowledge can be acquired from the training data alone. However, when people are writing, they not only rely on the data but also consider related knowledge. In this paper, we enhance neural data-to-text models with external knowledge in a simple but effective way to improve the fidelity of generated text. Besides relying on parallel data and text as in previous work, our model attends to relevant external knowledge, encoded as a temporary memory, and combines this knowledge with the context representation of data before generating words. This allows the model to infer relevant facts which are not explicitly stated in the data table from an external knowledge source. Experimental results on twenty-one Wikipedia infobox-to-text datasets show our model, KBAtt, consistently improves a state-of-the-art model on most of the datasets. In addition, to quantify when and why external knowledge is effective, we design a metric, KBGain, which shows a strong correlation with the observed performance boost. This result demonstrates the relevance of external knowledge and sparseness of original data are the main factors affecting system performance.


Introduction
Automatic text generation from structured data (data-to-text) is a classic task in natural language generation which aims to automatically generate fluent, truthful and informative texts based on structured data (Kukich, 1983;Holmes-Higgin, 1994;Reiter and Dale, 1997). Data-to-text is often formulated into two subproblems: content selection which decides what contents should be included in the text and surface realization which * Contribution during internship at Microsoft Research. determines how to generate the text based on selected contents. Traditionally, these two subproblems have been tackled separately. In recent years, neural generation models, especially the encoder-decoder model, solve these two subproblems jointly and have achieved remarkable successes in several benchmarks (Mei et al., 2016;Lebret et al., 2016;Wiseman et al., 2017;Dušek et al., 2018;Nie et al., 2018). Such end-to-end data-to-text models rely on massive parallel pairs of data and text to learn the writing knowledge. They often assume that all writing knowledge can be learned from the training data. However, when people are writing, they will not only rely on the data contents themselves but also consider related knowledge, which is neglected by previous methods. For example, as shown in Fig. 1, an infobox about a person called Nacer Hammami is paired with its corresponding biography description from the Wikipedia. However, the information in the infobox is not enough to cover all the facts mentioned in the description. To generate this description from the in-fobox, we need to expand information based on external background knowledge from its related entities. For example, in the description: 1) "is an Algerian football player" indicates the nationality of Nacer Hammami which is not explicitly stated in the infobox. However, the place of birth, Guelma, of Nacer Hammami is given, therefore the nationality can be inferred from the knowledge that Guelma is a place in Algeria. 2) "playing for MC El Eulma in the Algerian Ligue Professionnelle 1" describes the fact that MC El Eulma is a club in the Algerian Ligue Professionnelle 1 which is also not explicitly stated in the infobox which can be expanded from external knowledge base.
One may argue that neural models can learn such knowledge when enough parallel cooccurrence pairs such as (Guelma, Algerian) and (MC El Eulma, Algerian Ligue Professionnelle 1) are available. However even in such case, neural models still tend to make mistakes for sparse co-occurrence pairs as we will show in the experiments section.
In this paper we enhance neural-networkbased data-to-text generation models with external knowledge in a simple but effective way. Besides learning the association between data and text from parallel data-text pairs as in previous work, our model attends to relevant external knowledge, encoded as a temporary memory, and combines this knowledge with the context representation of data before generating words. Specifically, both infobox and background knowledge facts are encoded and a dual-attention mechanism is proposed to guide the decoder to generate text.
To verify the effectiveness of our proposed model, Knowledge Base enhanced Attentionbased sequence-to-sequence network (KBAtt), we conduct experiments on multiple Wikipedia infobox-to-text datasets including WikiBio (Lebret et al., 2016) and 20 new datasets 1 . Our experiment results show that KBAtt consistently improves a state-of-the-art neural data-to-text model to achieve higher performances on most of the datasets. To quantify when and why external knowledge is effective, we design a metric which shows a strong correlation with the observed performance boost. This result demonstrates the relevance of external knowledge and sparseness of original data are the main factors affecting system 1 Available at https://github.com/hitercs/ WikiInfo2Text performance.
The contributions of our work can be summarized as follows: • We demonstrate that external knowledge base could be used to enhance the performance of neural data-to-text models.
• We propose a simple yet effective model, KBAtt, to integrate external knowledge using a dual-attention mechanism.
• We design a metric, KBGain, to quantify when and why external knowledge is effective.
• We contribute twenty infobox-to-text datasets from a variety of domains.

The Proposed Model
Our model takes a data table D (e.g., a Wikipedia infobox) and a relevant external knowledge base (KB) containing a set of facts F as input and generates a natural language text y = y 1 , ..., y T consisting of T words. To augment the infobox with external knowledge, we preserve the Wikipedia internal hyperlink information in the field values of infobox, and track these hyperlinks to get their corresponding entities from Wikidata 2 where we retrieve only one-hop facts. The backbone of our model is an attention based sequence-to-sequence model  equipped with copy mechanism (See et al., 2017). As shown in Fig. 2, the model consists of four main components: a table encoder, a KB encoder, the dual attention mechanism and a decoder. We describe each component in the following sections.

Table Encoder
In Fig. 2, the input data table D consists of several field name and field value pairs. We follow (Sha et al., 2017;   .] is the concatenation of vectors. Then each x i is encoded into a hidden vector h i using a bi-directional GRU .

Knowledge Base Encoder
As shown in Fig. 2, we extract entities mentioned in the field value of infobox and link them to Wikidata. Then we can retrieve relevant facts whose subject is the linked entity from Wikidata. These facts contain important background knowledge related to the infobox which is helpful for generation. The KB fact set is denoted by F = {(n j , s j , r j , o j )} |F | j=1 , where s j , r j , o j is the subject, relation and object of the fact respectively, and field name n j indicates the current fact is linked by the field value of n j . In order to integrate such KB facts into the neural model, we apply Multi-Layer Perceptron (MLP) to encode each fact into its representation: where W f and b f are trainable weights and bias, while e n j , e s j , e r j and e o j is the embedding of field name n j , subject s j , relation r j and object o j respectively. To accommodate generation steps where no information from the external knowledge is needed, such as generating name field which is already stated in the table, we apply a simple strategy by padding a none fact in the knowledge base.

Dual Attention Mechanism
After encoding the table and background knowledge base facts, we apply a RNN-based decoder to generate words conditioned on both table information and background knowledge fact information. In general, given a decoder hidden state d t at timestep t, we apply dual attention mechanism including table attention and KB attention to determine which parts that it should pay attention to. Next we will introduce table attention and KB attention briefly.

Table Attention
The where W a , U a and v a are trainable parameters. α t,i is the table attention weight.

KB Attention
Besides utilizing table information, the words generated may contain facts which are not directly mentioned in the table but could be inferred from the background knowledge F . In order to integrate such knowledge into the decoder, we apply KB attention over {f j } |F | j=1 . Similar to the table attention, we can get KB context representation c kb t .

Decoder
The decoder is a single layer GRU equipped with copy mechanism. As for the generation mode, given a decoder hidden state d t , the decoder attends both table and knowledge base using mechanism described above, and get table context representation c table   t and KB context representation c kb t . So the context representation at time step t is given by: Then given table D and knowledge base facts set F , the probability of word y t generated from a fixed vocabulary at time step t is defined as follows: where f (·) is a non-linear function applying on the decoder hidden state d t , previous word embedding y t−1 and the current context vector c t .
To tackle the rare and unknown words problem, we adopt the copy mechanism (See et al., 2017). Specifically, a gate p gen ∈ [0, 1] is introduced to switch between copy mode and generation mode. The generation probability p gen is defined as: where σ stands for the sigmoid function while vectors w c , w d , w y and scalar b are trainable parameters. The joint probability for generating y t at time step t is formulated as follows: P (y t |y <t , D, F ) = p gen P vocab (y t ) (8) where α t,i is the table attention weight defined in Equation 3.

Training
Given training dataset {(D k , F k , y k )} S k=1 consisting of S samples, our goal is to maximize the probability of target description y = y 1 , ...y T given the input table D and the background knowledge facts F . So the objective function is to minimize the negative log-likelihood: The objective function is fully differentiable, so the entire model can be trained end-to-end through backpropagation. Adam (Kingma and Ba, 2014) is used to optimize our model.

Subject
Relation Object

When and Why External Knowledge is Beneficial
In this section, we introduce KBGain to quantify when and why external knowledge (KB) is effective, and then show how KBGain correlates with the performance boost of KBAtt over Seq2Seq+Copy in 21 datasets in Section 4.4. Intuitively, incorporating an external KB should improve data-to-text generation performance. However, to pinpoint the effect of the additional knowledge is not trivial since we know that (a) not all external knowledge is relevant and (b) neural models may memorize certain inference patterns when parallel data is big enough. We assume (1) matching tokens between the external KB and the references indicate relevance of the tokens in the external KB; (2) high frequency of co-occurrence of matching tokens between the infobox and the references indicate good potential for neural models to learn generation patterns which leads to less effectiveness of an external KB, i.e. no data sparsity in the original data. To characterize these factors, we introduce KBGain. KBGain measures the portion of learnable tokens in the references co-occurred with their corresponding external KB entries but filter out those tokens which could also co-occurred with the infobox. We say a token is learnable from one source (infobox or KB) when the co-occurrence frequency between them is higher than a threshold γ which is the minimum size of co-occurrence above which learning will be effective 3 . The top table in Fig. 3  occurrence does, then this is where the KB will be effective. KBGain is defined as the average ratio between count of the tokens falling into category C and the length of their corresponding reference on the test set.
Specifically, for all tokens in the reference except stop words and punctuation, we select those matched 4 on string with the object tokens of KB but not with the infobox. For example, the token Algerian in the reference is matched with the object of the last KB tuple. The KB-Ref token pair is simply acquired by string matching while Infobox-Ref token pair can be tracked by firstly finding the corresponding field based on the matched tuple, and then selecting the token in its field value with the highest pointwise mutual information (PMI).

Datasets
In the experiments, we adopt the WikiBio (Lebret et al., 2016) along with twenty new infoboxto-text datasets collected from Wikipedia 5 . The full statistics of these datasets could be found in the Appendix A.2. Table 1 shows the statistics of two sample datasets. Each dataset consists of infoboxes as input data and the first sentences of their corresponding Wikipeida articles as references. For example, on datasets WikiBio and Album, we extract 5.9 and 5.5 entities from each table, and each entity has 19.0 and 7.2 extended 4 To make it simple, we adopt strict string matching. This is acceptable for rough quantification of the intended measurement since precise and semantic-aware matching is still an active research area. We also try several popular stemmers to expand tokens e.g., Algeria to Algerian, but no stemmers have such capability. 5 These twenty datasets are similar to WikiBio but from different domains, e.g. Album, Book etc., which are characterized by infobox template category name. They are created with the similar procedure and the same Wikipedia dumps as outlined in Lebret et al. (2016). For more details about data collection, please refer to the Appendix A.1.

Experiment Setup
We conduct experiments on the datasets as introduced in Section 4.1. Seq2Seq+Copy is the main baseline: a sequence-to-sequence (Seq2Seq) model equipped with copy mechanism from See et al. (2017) which is one of the state of art methods. We also compare our results with other published results using the WikiBio dataset. The model structure of our baseline model is most similar to Sha et al. (2017) by removing their specialized design on order planning which is not the focus of this paper. Since our aim is to verify the effectiveness of external knowledge for datato-text task, we keep our baseline model as general as possible without other specialized design. Our primary model is KBAtt: a model which integrates the background knowledge into baseline model through a KB encoder and KB attention mechanism. We employ BLEU (Papineni et al., 2002) as the automatic evaluation metric. In addition to BLEU, we conduct human evaluation to assess the factual accuracy of generated sentences.

Training Details
The dimensions of all trainable word embeddings are set to 512, and the GRU hidden states sizes are set to 512. To limit the memory of our model, we set the maximum number of facts per table to 500. We initialize all the model parameters randomly using a uniform distribution between -0.08 and 0.08. For the model training, we use Adam (Kingma and Ba, 2014) with initial learning rate of 0.001 as the optimization algorithm. The training batch size is set to 64. We also apply gradient clipping (Pascanu et al., 2013) with range [-1,1] during training. We conduct experiments using single card NVIDIA Tesla V100. The largest 15 datasets with more than 9500 training instances are trained for 20 epochs, while the remaining 6 datasets are trained for 40 epochs. All the models were selected based on BLEU-4 score on the development set. All the experiments use greedy search as the decoding algorithm during testing.

Overall Results
To verify the effectiveness of KBAtt, we conduct experiments on 21 Wikipedia infobox-to-text datasets. Table 2 shows the performances of KBAtt and Seq2Seq+Copy in terms of BLEU-4 score. As we can see, KBAtt consistently outperforms Seq2Seq+Copy on most of the datasets, i.e., more than 0.5 BLEU-4 improvements on 15 out of 21 datasets, and comparable on the remaining 6 categories. To get a better understanding of when and why external knowledge will be effective or not, we correlate the performance gains in terms of BLEU-4 with KBGain metric described in Section 3 across 21 datasets, and plot their values on the scatter plot in Fig. 4. The Pearson correlation coefficient (ρ) between BLUE-4 improvements and KBGain is 0.716 which indicates a strong positive correlation between them. This confirms our analysis in Section 3 that KBAtt is effective when the external knowledge is relevant and the original data is sparse.

Results on WikiBio
To compare with state-of-the-art models, Table 3 shows the results on the WikiBio dataset. Among them, our baseline Seq2Seq+Copy gains a performance of 44.28 which is comparable to Sha et al. Our proposed model, namely KBAtt, obtains a BLEU-4 score of 44.59. Although the absolute improvement (+0.31 BLEU-4 points) of KBAtt over Seq2Seq+Copy is relative small, the difference between them is statistically significant under the one-tailed paired t-test at the 99% significance level. The reason why the absolute improvement is relative small is that the full WikiBio dataset consists of 728,321 parallel data-to-text pairs which are enough for neural models to memorize certain inference patterns for high frequency pairs. However, as will be shown in the Section 4.5, the baseline fails on low co-occurrence frequency pairs, but the KBAtt avoids this problem with the contributions from the external knowledge.

Human Evaluation
We conduct human evaluation to assess the factual accuracy of the generated sentences. Manually evaluating the generated results of all datasets is labour intensive, so we choose to evaluate two sample datasets for case study purposes. Specifically, we sample 50 instances each from the Wik-iBio and the Album datasets, and ask two annotators to extract facts tuples (subject, relation, object) from the references and the generated sentences 7 . In table 4, precision P 1 , recall R 1 and their overall score F 1 measure the extent that the facts extracted from the generated sentences conform to those from references. As we can see, KBAtt achieves 1.21%, 7.42% improvements in terms of F 1 on WikiBio and Album respectively, which shows that KBAtt can generate more relevant facts with respect to the reference than the Model BLEU-4 Table NLM (Lebret et al., 2016) 34.70 Table2Seq (Bao et al., 2018) 40.26 Order Planning, full model (Sha et al., 2017) 43.91 Field-gating Seq2Seq, full model  44   Seq2Seq+Copy baseline. We further ask the annotators to judge whether the facts extracted from the generated texts are correct against information from the infobox, the external knowledge (eg. Wikidata), or even search engines. In table 4, P 2 measures the ratio of correct facts in the generated results. We observe that KBAtt improves 3.50% and 6.26% over Seq2Seq+Copy in Wik-iBio and Album respectively. This shows that KBAtt is more likely to generate accurate facts than Seq2Seq+Copy.

Analysis of Few-Shot Learning Ability
To examine the ability of learning writing knowledge from few examples, we design an experiment to compare the performance under different number of training samples for the baseline and our model. In the training set of WikiBio, about 78.5% tables contain place of birth field but only 19.0% tables include the nationality field. However, in the references, nationality of a person is frequently mentioned. This means that the ability of inferring nationality based on place of birth is important. We collect the cases from the test data with the following conditions: (a) only city level information is given in the field of place of birth; (b) nationality is not specified in the table; (c) the reference mentions country name or nationality. From these cases, we get over 400 unique places of birth to nationality inference pairs and split them into two intervals [1, 25) and [25, ∞) according to their cooccurrence frequency in the training set 8 . For each interval, we randomly sample 50 test cases and 8 The interval threshold 25 is set by following that for KB-Gain.  manually assess the accuracy of nationality information mentioned in the generated sentences 9 . Fig. 5 shows the accuracy of nationality information in the generated text with respect to different co-occurrence frequency intervals: Firstly, we found that the baseline model struggles to learn the inference from place of birth to nationality when their co-occurrence frequency in training set is less than 25, which shows the difficulty of this task; Secondly, in interval [1, 25), the baseline model only gets 42.0% accuracy, but our model achieves 80.0% accuracy, a 38.0% absolute gain. This confirms our motivation that incorporating external knowledge into neural models can improve model performances especially when the original data is sparse; Finally, the accuracy of both models increases as the frequency goes from [1, 25) to [25, ∞), and the improvement of our model over the baseline model gradually narrows, from 38.0% to 6.0%. This shows that the baseline model could learn part of inference patterns when enough parallel data is available. Fig. 6 shows three examples from the development set of WikiBio dataset which can demonstrate how KBAtt succeeds or fails. Fig. 6 (a) and (b) illustrate KBAtt is helpful when original data is sparse. However when original data is dense, the baseline model still can learn it. As shown in Fig. 6 (a), the baseline model struggles to learn direct association between "german"and "gummersbach" (birth place in the table), since they only co-occur 12 times in the training data. But it is easy for KBAtt to learn since "german" co-occurs with "germany" in KB 12,437 times in the training data. As shown in Fig. 6 (b), although the national- Reference henry roberts ( 16 april 1803 --9 march 1876 ) was a british architect best known for fishmongers ' hall in london and for his work on model dwellings for workers .

Seq2Seq+Copy
henry roberts ( 1803 --1876 ) was an english architect who designed many buildings in florence , florence , and florence .
(c) Figure 6: Case study. Three generation examples from development set of WikiBio. Each example consists of the input infobox, parts of external knowledge base, the reference and two generated sentences by Seq2Seq+Copy and KBAtt. We mark correct fact information as blue and incorrect ones as red.
ity information is not explicitly stated in the table, both Seq2Seq+Copy and KBAtt generate correct nationality information ("australian"). The reason is that "australian" co-occurs with "melbourne" in the table 3,391 times in the training data. So, it is easy for the neural model to learn such inference patterns. Fig. 6 (c) illustrates the pattern of inferring nationality from birth place is not always correct. In this case, KBAtt makes a plausible inference on nationality from birth place, so it generates "american". However, this inference pattern doesn't hold for this case, because he is British. Hopefully, such cases are rare since the birth place conforms to the nationality in most of cases, so our methods can bring improvement as indicated by the overall better performance across almost all Wikipedia categories.
With the advent of neural text generation, the distinction between content selection and surface realization becomes blurred. For example, Mei et al. (2016) proposed an end-to-end encoder-aligner-decoder model to learn both content selection and surface realization jointly which shows good results on WeatherGov and RoboCub datasets. Wiseman et al. (2017) generate long descriptive game summaries from a database of basketball games where they show the current state-of-the-art neural models are quite good at generating fluent outputs, but perform poorly in content selection and capturing long-term structure. Our work falls into the task of single sentence generation from Wikipedia infoboxes. The model structure ranges from feed-forward networks work (Lebret et al., 2016) to encoderdecoder models (Sha et al., 2017;Bao et al., 2018;Nema et al., 2018). Recently, Perez-Beltrachini and Lapata (2018) generalize this task to multi-sentence text generation, where they focus on bootstrapping generators from loosely aligned data. However, most of the work mentioned above assume all the writing knowledge can be learned from massive parallel pairs of training data. Different from the previous work, we exploit incorporating external knowledge into this task to improve the fidelity of generated text.
Our work is also relevant to recent works on integrating external knowledge into neural models for other NLP tasks. The motivations of incorporating external knowledge range from enriching the context information (Mihaylov and Frank, 2018) in reading comprehension, improving the inference ability of models (Chen et al., 2018) in natural language inference, to providing the model a knowledge source to copy from in language modelling (Ahn et al., 2016). Our model, KBAtt, is most relevant to Mihaylov and Frank (2018), where they focus on similarity calculation but we focus on generation in this paper. Moreover, in addition to demonstrating the positive effect of incorporating external knowledge as previous work, we also design a new metric to quantify the potential gains of external knowledge for a specific dataset which can explain when and why our model is effective.

Conclusion
In this paper, we propose a neural data-to-text generation model, KBAtt, that incorporates external background knowledge in a simple but effective way to improve fidelity of the generated text. Experiments on 21 Wikipedia infobox-to-text datasets show KBAtt consistently achieves better performance in BLEU than a state-of-the-art baseline on most of datasets. Meanwhile, to quantify when and why external knowledge is effective, we design a metric, KBGain, which shows a strong correlation with observed performance boost. This result indicates the relevance of external knowledge and sparseness of original data are the main factors affecting the effectiveness of KBAtt.
In the future, we plan to investigate integrating multi-hops knowledge graph behind the data which has potential to further improve the inference ability of neural models. It will be worthwhile especially when we extend the task to multiple sentence generation. The main challenge in integrating multi-hops knowledge graph is the large search space. We plan to employ reinforcement learning based techniques to allow the model to search the optimal inference paths by trial and error. Besides, we are also interested in integrating external knowledge into other types of datasets beyond Wikipedia infobox-to-text datasets.  (Manning et al., 2014). Different from (Lebret et al., 2016), we keep the capitalization information which we believe is important for real-world application. To augment the infoboxes with external knowledge, we preserve the Wikipedia internal hyperlink information in the field values of infoboxes, and track these hyperlinks to get their corresponding entities from Wikidata (dumps version: 20150831). Finally we retrieve one-hop facts from Wikidata. The dataset is aviliable at https://github. com/hitercs/WikiInfo2Text.  in Wikipedia. However, the size of remaining datasets ranges from 4,559 to 327,863 and more than 60% of them have less than 20,000 instances. So, the abundance of data such as WikiBio is not common among the Wikipedia infobox-to-text datasets. Meanwhile, these datasets vary in sentence length and length of tokens in table. In addition, the number of entities extracted from each table ranges from 0.8 to 10.1 and the average number of fact tuples per entity ranges from 5.2 to 126.1.  The threshold γ in KBGain metric denotes the minimum size of co-occurrence above which learning will be effective. We tune this threshold by finding the maximum pearson correlation coefficient (ρ) between BLEU-4 improvements and the values of KBGain on the 21 development datasets. Fig. 7 shows the curve of pearson correlation coefficient w.r.t co-occurrence frequency threshold ranging from 5 to 100. As we can see, ρ reaches peak value as 0.753 when the threshold is 25. So γ is set 25 throughout this paper.