Describing a Knowledge Base

We aim to automatically generate natural language descriptions about an input structured knowledge base (KB). We build our generation framework based on a pointer network which can copy facts from the input KB, and add two attention mechanisms: (i) slot-aware attention to capture the association between a slot type and its corresponding slot value; and (ii) a new table position self-attention to capture the inter-dependencies among related slots. For evaluation, besides standard metrics including BLEU, METEOR, and ROUGE, we propose a KB reconstruction based metric by extracting a KB from the generation output and comparing it with the input KB. We also create a new data set which includes 106,216 pairs of structured KBs and their corresponding natural language descriptions for two distinct entity types. Experiments show that our approach significantly outperforms state-of-the-art methods. The reconstructed KB achieves 68.8% - 72.6% F-score.


Introduction
Show and tell, showing an audience something and telling them about it, is a common classroom activity for early elementary school kids. As a similar practice for knowledge propagation, we often need to describe and/or explain the information in a structured knowledge base (KB) in natural language, in order to make the knowledge elements and their connections easier to comprehend.
For example, (Cawsey et al., 1997) presents a natural language generation system to convert structured medical records to natural language text descriptions, which enables more effective communication between health care providers and their patients and among health care providers themselves.
Moreover, 51% of entity attributes in the current English Wikipedia Infoboxes are not described in English articles in the Wikipedia dump of April 1, 2018. The availability of vast amounts of Linked Open Data (LOD) and Wikipedia derived resources such as DBPedia, WikiData and YAGO encourages pursuing a new direction of knowledge-driven Lu et al., 2018) or semantically oriented (Bouayad-Agha et al., 2013) Natural Language Generation (NLG). We aim to fill in this knowledge gap by developing a system that can take a KB (consisted of a set of slot types and their values) about an entity as input (see example in Table 1), and automatically generate a natural language description (  Neural generation to generalize linguistic expressions. One major challenge lies in generalizing a wide variety of expressions, patterns, tem-  Table 1 plates and styles which human use to describe the same slot type. For example, to describe a football player's membership with a team, we can use various phrases including member of, traded to, drafted by, played for, face of, loaned to and signed for. Instead of manually crafting patterns for each slot type, we leverage the existing pairs of structured slots from Wikipedia infoboxes and Wikidata (Vrandečić and Krötzsch, 2014) and the corresponding sentences describing these slots in Wikipedia articles as our training data, to learn a deep neural network based generator.
Pointer network to copy over facts. The previous work  considers the slot type and slot value as two sequences and applies a sequence to sequence (seq2seq) framework (Cho et al., 2014) for generation. However, the task of describing structured knowledge is fundamentally different from creative writing, because we need to cover the knowledge elements contained in the input KB, and the goal of generation is mainly to clearly describe the semantic connections among these knowledge elements in an accurate and coherent way. The seq2seq model fails to capture such connections and tends to generate wrong information (e.g., Thailand in Table 2). To address this challenge, we choose a pointer network (See et al., 2017) to copy slot values directly from the input KB.
Slot type attention. However, the copying mechanism in the pointer network is not able to capture the alignment between a slot type and its slot value, and thus it often assigns facts to wrong slots. For example, 22 in Table 2 should be the number of matches instead of birth date. It also tends to repeat the same slot value based on language model, e.g., "Uroplatus ebenaui is a of gecko endemic to Madagascar. The Uroplatus is a member of the species of the genus Madagascar.". We propose a Slot-aware Attention mechanism to compute slot type attention and slot value attention simultaneously and capture their correlation. Attention mechanism in deep neural networks (Denil et al., 2012) is inspired from human visual attention, which refers to human's capability to focus on a certain region of an image with high resolution while perceiving the surrounding image in low resolution. It allows the neural network to have access to the hidden state of the encoder, and thus learn what to attend to. For example, for a Date of Birth slot type, words such as born may receive higher attention than female. As we can see in Table 2 (+Type), the output with slot type attention contains more precise slots.
Table position attention. Multiple slots are often interdependent. For example, a football player may join multiple teams, with each team associated with a certain number of points, goals, scores and games participated. We design a new table position based self-attention to capture correlations among interdependent slots and put them in the same sentence. For example, our model successfully associates the number of matches 22 with the Israel women's national football team as shown in Table 2.
The major contributions of this paper are: • For the first time, we propose a new table position attention which proves to be effective at capturing inter-dependencies among facts. This new approach achieves 2.5%-7.8% Fscore gain at KB reconstruction.
• We propose a KB reconstruction based metric to evaluate how many facts are correctly expressed in the generation output.
• We create a large dataset of KBs paired with natural language descriptions for 106,216 entities, which can serve as a new benchmark.

Model
We formulate the input structured KB to the model as a list of triples: L = [(s 1 , v 1 , (r 1 ,r 1 )), ..., (s n , v n , (r n ,r n ))], where s i denotes a slot type (e.g., Country of Citizenship), v i denotes the corresponding slot value (e.g., Israel), and (r i ,r i ) denotes the position of the triple in the input list and consists of the forward position r i and the backward position r i = n − r i + 1. The outcome of the model is a paragraph Y = [y 1 , y 2 , ..., y m ]. The training instances for the generator are provided in the form of:

Sequence-to-Sequence with Slot-aware Attention
Following previous studies on describing structured knowledge (Lebret et al., 2016;, we apply a sequence-tosequence based approach, and incorporate a slotaware attention to generate the descriptions. Encoder Given a structured KB input: L = [(s 1 , v 1 , (r 1 ,r 1 )), ..., (s n , v n , (r n ,r n ))], where s i , v i , r i ,r i are randomly embedded as vectors s i , v i , r i ,r i 2 respectively, we concatenate the vector rep-resentations of these fields as l i = [s i , v i , r 1 ,r 1 ], and obtain L = [l 1 , l 2 , ..., l n ]. We attempted to apply the average of L as the representation for the input KB. However, such flat representation vectors fail to capture the structured contextual information in the entire KB. Therefore, we apply a bi-directional Gated Recurrent Unit (GRU) encoder (Cho et al., 2014) on L to produce the encoder hidden states H = [h 1 , h 2 , ..., h n ], where h i is a hidden state for l i .
Decoder with Slot-aware Attention The decoder is a forward GRU network with an initial hidden state h n , which is the encoder hidden state of the last token. In order to capture the association between a slot type and its slot value, we design a Slot-aware Attention. At each step t, we compute the attention distribution over the sequence of input triples. For each triple i, we assign it an attention weight: whereh t is the decoder hidden state at step t.
s i and v i denote the embedding representations of slot type s i and slot value v i respectively. c t i = t−1 k=0 α k i is a coverage vector, which is the sum of attention distributions over all previous decoder time steps and can be used to reduce repetition (See et al., 2017).
The source attention distribution α t can be considered as the contribution of each source triple to the generation of the target word. Next we use α t to compute two context vectors L * s and L * v as the representation of the slot types and values respectively: At step t, the vocabulary distribution P vocab is computed with the context vectors L * s , L * v and the decoder hidden stateh t , using an affine-Softmax layer: The loss function is computed as: where P vocab (y t ) is the prediction probability of the ground truth token y t . λ is a hyperparameter.

Table Position Self-attention
Although the sequence-to-sequence attention model takes into account the information of input triples, it still encodes the structured knowledge as sequential facts while ignoring the correlations between facts.
In our task, multiple interdependent slots should be described within one sentence. For example, in Table 1, the sport team Israel women's national football team should be described together with 22 matches and 29 goals. Previous studies (Lin et al., 2017;Vaswani et al., 2017) applied self-attention on sentence level to capture the correlation between continuous tokens. Inspired by these approaches, we design a new table position based self-attention and incorporate it into the slot-aware attention.
In our task, since most triples are organized in temporal order, we use the row index r and the reverse row indexr to denote the position information of each triple in the input KB. Given a structured KB as input: L = [(s 1 , v 1 , (r 1 ,r 1 )), ..., (s n , v n , (r n ,r n ))], we obtain a sequence of row index embeddings R = [r 1 , r 2 , ..., r n ] with random initialization, where r i = [r i ;r i ]. We model the inter-dependencies among slots as a latent structure, where for each position i we assume it has a latent in-link and an out-link to denote where it is linked to or from. This assumption is similar to the structure attention applied in Liu and Lapata (2018), which assumes each word within a sentence can be a parent node or a child node in a latent tree structure. For each pair of slots i and j, we compute the attention score f ij as follows: where W in , W out , and W g are learnable parameters. The attention score will not change during the decoding process.
f ij can be viewed as the contribution from a context triple j to triple i. For each slot s i and value v i , we obtain a context vector by collecting information from other slot types and their values: We further encode position-aware representation of each slot type and value, and update their context vectors L * t and L * v in Equation 1 as:

Structure Generator
Traditional sequence-to-sequence models predict a target sequence by only selecting words from a vocabulary with a fixed size. However, in our task, we regard the slot value as a single information unit. Therefore, there is a certain amount of outof-vocabulary (OOV) words during the test phase. Inspired by the pointer-generator (Gu et al., 2016;See et al., 2017), which is designed to automatically locate particular source words and directly copy them into the target sequence, we design a structure-aware generator as follows.
We first obtain a source attention distribution of all unique input slot values. Since one particular slot value may occur in the structure input for many times, we aggregate the attention weights for each unique slot value v j from α t and obtain its aggregated source attention distribution P j source by The gates in neural networks act on the signals they receive, and block or pass on information based on its strength. In order to combine two types of attention distribution P source and P vocab , we compute a structure-aware gate p gen ∈ [0, 1] as a soft switch between generating a word from the fixed vocabulary and copying a slot value from the structured input: where y t−1 is the embedding of the previous generated token at time t−1, and σ is a Sigmoid function.
The final probability of a token y at time t can be computed by p gen , P vocab and P source : P (y t ) = p gen P vocab + (1 − p gen )P source The loss function, combining with the coverage loss (See et al., 2017), is presented as: where P (y t ) is the prediction probability of the ground truth token y. λ is a hyperparameter.  (6). Build a fixed vocabulary for the whole corpus of ground-truth descriptions and label the words with frequency < 5 as OOV. We further randomly shuffle and split the dataset into training (80%), development (10%) and test (10%) subsets for person and animal entities respectively. Table 3 shows the detailed statistics. Compared with the Wikibio dataset used in previous studies (Lebret et al., 2016;, which contains one sentence only as the ground-truth description, our dataset contains multiple sentences to cover as many facts as possible in the input structured KB.

Evaluation Metrics
We apply the standard BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), and ROUGE (Lin, 2004) metrics to evaluate the generation performance, because they can measure the content overlap between system output and ground-truth and also check whether the system output is written in sufficiently good English. In addition, we can also consider natural language as the most expressive way for knowledge transmission via a noisy channel. If we are able to reconstruct the input KB from the generated description, our generator achieves a 100% success rate at knowledge propagation. We propose a KB reconstruction based metric as follows: for each entity, construct a KB from the generated paragraph, and compute precision, recall and F-score by comparing it with the input KB from two aspects: (1). Overall Slot Filling: If a pair of slot type and its slot value exists in both of the reconstructed KB and the input KB, it's considered as a correct slot. (2). Inter-dependent Slot Filling: If a row that consists one or multiple slot types and their slot values exist in both of the reconstructed KB and the input KB, it's considered as a correct row.
If the same slot/row is correctly described multiple times in the system generation output, it's only counted as correct once, i.e., redundant descriptions will be penalized. This metric is further illustrated in Figure 2. It's similar to the relation extraction based generation evaluation metric proposed by (Wiseman et al., 2017) and entity/event extraction based metric proposed by Lu et al., 2018). They compared automatic Information Extraction results from the reference description and the system generation output. However, the performance of state-of-theart open-domain slot filling (Wu and Weld, 2010;Fader et al., 2011;Min et al., 2012;Xu et al., 2013;Angeli et al., 2015;Bhutani et al., 2016; is still far from satisfactory to serve as an automatic extraction tool for evaluating generation results. Therefore for the pilot study in this paper we manually reconstruct KBs from the generation output for evaluation. Notably none of the above automatic metrics is sufficient to capture adequacy, grammaticality and fluency of the generated descriptions. However extrinsic metrics such as system purpose and user task are expensive, while cheaper metrics such as human rating do not correlate with extrinsic metrics (Gkatzia and Mahamood, 2015). Moreover the task we address in this paper requires essential domain knowledge for a human user to assess the generated descriptions.

Baseline Models
We compare our approach with the following models: (1). Seq2seq attention model (Bahdanau et al., 2015). We concatenate slot types and values as a sequence, e.g., {Name, Silvi Jan, Sports team, ASA Tel Aviv University, Hapoel Tel Aviv F.C. ...} for Table 1, and apply the sequence to sequence with attention model to generate a description. (2). Pointer-generator (See et al., 2017) which introduces a soft switch to choose between generating a word from the fixed vocabulary and copying a word from the input sequence. Here, we concatenate all slot values as the input sequence, e.g., {Silvi Jan, ASA Tel Aviv University, Hapoel Tel Aviv F.C. ...} for Table 1. (3). Pointer-generator + slot type attention which incorporates the slot type attention (Section 2.1) into the pointer-generator. We use the sequence of (slot type, slot value) pairs as input, e.g., {(Name, Silvi Jan), (Sports team, ASA Tel Aviv University), (Sports team, Hapoel Tel Aviv F.C.) ...} for Table 1. Table 4 shows the hyperparameters of our model.   Table 5 shows the performance of various models with standard metrics. We can see that our attention mechanisms achieve consistent improvement. We conduct paired t-test between our proposed model and all the other baselines on 10 randomly sampled subsets. The differences are statistically significant with p ≤ 0.016 for all settings. As shown in Table 6 and Table 7, the KBs reconstructed from models with these two attention mechanisms achieve much higher quality. Figure 3 and Figure 4 visualize the attentions applied to the walk-through example in Table 1.

Results and Analysis
Impact of Slot-aware Attention. The same string can be filled into various slots of multiple types. For example, dates, ages, the number of matches and goals can all be presented as numbers. The pointer network often mistakenly mixes them up. For example, it produces "24 September 1979 was born 3 October 1903 in 17 on 33 October 1906", where 33 should be the number Figure 3: Slot Type Attention Visualization (Context words strongly associated with certain slot types receive high weights, e.g., capped to describe member of sports team, and times to describe the number of matches played. ) of matches and 17 should be the number of goals. In contrast our model with slot type attention correctly generates "he made 33 appearances and scored 17 goals". In addition, as mentioned earlier, the pointer network often produces redundant slot values because it loses control of slot types, e.g., "He was born in the city of Association football. In the late 1990s he was appointed manager of the Association football team of the team.". Table Position Attention. The table position attention successfully captures interdependent slots, such as a membership with a sports team and its corresponding number of matches and games: "Bill Sampy ... who played for Sheffield United F.C. 41 times."; "Giancarlo Antognoni ... he was also a member of the Italy national football team at the 1982 FIFA World Cup.".

Impact of
Remaining Challenges. Some remaining errors are trivial to fix, such as fixing a country name to its adjective form when it appears right before a position slot (e.g., Italian professional Association football player instead of Italy professional Association football player). The KB reconstruction recall of person entities is relatively low mainly      Table 7: Inter-dependent Slot Filling Precision (P), Recall (R), F-score (F1) (%) because we don't have enough training data for some rare slot types.
Contextual words generated by the LM introduces some incorrect facts, especially temporal expressions. For example, the generator does not have the commonsense knowledge that football players could not play before they were born: "Aleksei Gasilin ( born 1 March 1996 ) is a Russian Association football Forward (association football). He made his professional debut in the Russian Second Division in 1992 for Russia national under-19 football team. ". Similarly, a football player would probably not be still active when he was already 72 years old: "Basil Rigg ( born 12 August 1926 ) is a former Australian rules football Rigg played for the Perth Football Club in the Western Australia cricket team from 1998 to 1998.".
Our approach sometimes fails to detect person gender so as to generate incorrect pronouns. For animal entities, human writers are able to elaborate more details. For example, human writes the specific endemic places for Brown treecreeper: "The bird endemic to eastern Australia has a broad distribution occupying areas from Cape York Queensland throughout New South Wales and Victoria to Port Augusta and the Flinders Ranges South Australia." while our system is only able to cover the generic location information "It is endemic to Australia." from the input KB.

Related work
Our task is similar to the WebNLG challenge generating text from DBPedia data (Gardent et al., 2017a). Previous approaches on generating natural language sentences from structured input KB can be divided into two categories: the first is to induce templates and then fill appropriate content into slots (Kukich, 1983;Cawsey et al., 1997;Angeli et al., 2010;Duma and Klein, 2013;Konstas and Lapata, 2013a;Flanigan et al., 2016a). These methods can generate high-quality descriptions but heavily rely on information redundancy to create templates. The second category is to directly generate a sequence of words using language model (Belz, 2008;Chen and Mooney, 2008;Liang et al., 2009;Angeli et al., 2010;Lapata, 2012a,b, 2013a,b;Mahapatra et al., 2016) or deep neural networks (Sutskever et al., 2011;Wen et al., 2015;Kiddon et al., 2016;Mei et al., 2016;Gardent et al., 2017b;Wiseman et al., 2017;Song et al., 2018). Several studies (Lebret et al., 2016;Chisholm et al., 2017;Kaffee et al., 2018a,b; generate a person's biography from an input structure, which are closely related to our task. However, instead of modeling the input structure as a sequence of facts and generating one sentence only, we introduce a table position self-attention, inspired from structure attention (Lin et al., 2017;Kim et al., 2017;Vaswani et al., 2017;Shen et al., 2018a,b), to capture the dependencies among facts and generate a paragraph to describe all facts.
In contrast to some recent work on converting structured Abstract Meaning Representation (Banarescu et al., 2013) into natural language (Pourdamghani et al., 2016;Flanigan et al., 2016b), our task requires us to capture inter-dependent relation links in a knowledge base and use them to generate multiple sentences in most cases. Our work is also related to attention mechanisms for sequence-tosequence generation (Bahdanau et al., 2015;Mei et al., 2016;Ma et al., 2017). Different from previous studies, our task requires the slot type and slot value to appear in the generated sentences in pairs. Thus we design a slot-aware attention to obtain two context vectors for both slot type and slot value simultaneously. To deal with OOV words, we use a structure generator, which is similar to the pointer-generator networks Luong et al., 2015;Gulcehre et al., 2016;See et al., 2017) and copy mechanism (Gu et al., 2016).

Conclusions and Future Work
We develop an effective generator to produce a natural language description about an input knowl-edge base. Our experiments show that two attention mechanisms focusing on slot type and table position advance state-of-the-art on this task, and provide a KB reconstruction F-score up to 73%. We propose a new KB reconstruction based evaluation metric which can be used for other knowledge-driven NLG tasks such as news image/video captioning. In the future, we aim to address the remaining challenges as summarized in Section 3.5, and tackle the setting where multiple facts of the same slot type are not presented in temporal order in the input KB. We also plan to extend the framework to cross-lingual cross-media generation, namely to produce a foreign language description or an image/video about the KB.