Dynamic Topic Tracker for KB-to-Text Generation

Recently, many KB-to-text generation tasks have been proposed to bridge the gap between knowledge bases and natural language by directly converting a group of knowledge base triples into human-readable sentences. However, most of the existing models suffer from the off-topic problem, namely, the models are prone to generate some unrelated clauses that are somehow involved with certain input terms regardless of the given input data. This problem seriously degrades the quality of the generation results. In this paper, we propose a novel dynamic topic tracker for solving this problem. Different from existing models, our proposed model learns a global hidden representation for topics and recognizes the corresponding topic during each generation step. The recognized topic is used as additional information to guide the generation process and thus alleviates the off-topic problem. The experimental results show that our proposed model can enhance the performance of sentence generation and the off-topic problem is significantly mitigated.


Introduction
In recent years, many knowledge bases (KBs) have been built to incorporate different kinds of human knowledge into a structured triple representation such as Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), YAGO (Suchanek et al., 2007) and Wikidata (Vrandečić and Krötzsch, 2014). Many tasks including question answering and recommendation systems have benefited from KBs (Wang et al., 2017) as external knowledge sources to improve the results. Though KBs have achieved great success in supporting and improving various text mining tasks, they are still incomprehensible to humans due to the over-rigid structured format. Reading a bunch of triples always annoys people since the form is not easily understandable especially to people who have never heard about KBs. In order to address this problem, recently some researchers have proposed the KB-to-text generation task (Lebret et al., 2016;Gardent et al., 2017a;Gardent et al., 2017b) to bridge the gap between KBs and natural language. This KB-to-text generation problem aims at directly converting a group of KBs triples into human-readable sentences. For example, given a triple group ( Bill Gates, BirthPlace, Seattle , Bill Gates, FounderOf, Microsoft ), the goal is to generate a comprehensible sentence such as "Bill Gates, the founder of Microsoft, was born in Seattle." Some works employ the techniques in the text generation area (Gatt and Krahmer, 2018) to tackle the KB-to-text problem. Though these models have achieved some success, there are still quite many limitations. One major drawback of existing models is that most of them suffer from the off-topic problem. Consider the example given in Figure 1, the topic of the target sentence is expected to change from "person" to "company" in the generation process. However, a model is prone to generate unrelated off-topic clauses like "Bill is a commonly used name in the USA." which is not consistent with the given input data and we recognize this phenomenon as the off-topic problem. This is because during the training  Figure 1: The off-topic problem. During the generation process, the given data ranges from the topic of person topic to the topic of company. However, the models are prone to generate off-topic sentences just because the models associate this kind of information with the word "Bill". stage, the models associate this kind of information with some input words like "Bill". In the testing or operational stage, when these words occur in the given data, the models are prone to generate off-topic sentences related to these words regardless of the given data.
To solve the off-topic problem, we propose to utilize the topic information as a piece of clue in the sentence generation process. Unfortunately, the corresponding topic information is not available and there is no existing dataset containing the topic annotations. Therefore, it is difficult to adopt supervised learning approaches for detecting topics. Moreover, it is even more expensive to annotate the dynamic change of topic information in one sentence, as exemplified by the topic changes from "person" to "company" in Figure 1. Therefore, we investigate the task of automatically detecting the hidden topic information and incorporating such information for the generation of sentences. Many works have been proposed to utilize the static topic information to improve the generation performance. Chen et al. (2016) and Ou et al. (2018) propose to represent the topic for each sentence as a learnable vector. The topic is predicted by the input sentence and is used to enhance the generating phase. Xing et al. (2017) and Zhang et al. (2016) detect the topic representation by applying a pre-trained LDA model on the input sequence. Moreover, Choudhary et al. (2017) and Ou et al. (2018) predict the topic representation directly from the input sequence using Recurrent Neural Networks (RNN). All the above methods make an assumption that during generation the topic does not change so as to make the problem tractable, which scarifies the advantage of modeling the dynamic nature of topic information.
We propose a novel Dynamic Topic Tracker (DTT) neural model to tackle the problem. Different from existing models, our proposed DTT model learns how the target sentence topic dynamically evolves and how to use the topic information to guide the generation process simultaneously. Specifically, our DTT model is a neural model composed of four parts, namely, the state tracker, the topic attention, the global topic bank, and the topic memory. The state tracker captures the decoder state for each generation step. The topic attention uses the captured decoder state to focus on the input hidden representation to get a local topic state. The topic bank learns a global hidden topic representation and it calculates the most suitable local topic representation for each local topic state. The topic memory is used to memorize the previous local topic representation and computes the dynamic topic state for each generation step to guide the target sentence generation procedure.

Related Work
Recently, various data-to-text tasks have been proposed handling different kinds of data. Gardent et al. (2017a;Gardent et al. (2017b) construct the WebNLG dataset which aims at generating text descriptions based on DBpedia (Auer et al., 2007) triples. Lebret et al. (2016) and Chisholm et al. (2017) propose to generate a person's biography based on Wikipedia's infobox. Fu et al. (2020a) build the WikiEvent dataset aiming at generating text based on an event chain. Novikova et al. (2017) generate restaurant reviews based on the information of restaurant attributes. Wiseman et al. (2017) generate basketball match descriptions based on the game records. Moreover, Fu et al. (2020c) propose to directly train the model on partially-aligned data called WITA while Fu et al. (2020b) propose to train a model based on purely unaligned data unsupervised with a dual learning framework. All of the above problems aim at converting some formatted data into natural language texts facilitating more understandability.
Some models have proposed to solve the KB-to-text problem by utilizing various information of the KBs. Chisholm et al. (2017) propose to directly rank the triples by relation frequency and flatten the triples to pure text. The flattened text is used as the input for a sequence-to-sequence model to generate the output text. Vougiouklis et al. (2018) propose to use a triple encoder to encode each triple into a hidden vector. The decoder input is constructed by simply concatenating all of the hidden vectors. Trisedya et al. (2018) propose a GTR-LSTM model to encode not only the triple information, but also the structure information of the entity graph into hidden semantic space. Jain et al. (2018) exploit a mixed hierarchical attention based encoder-decoder model to leverage the structure and content information. Shimorina and Gardent (2018) propose to use delexicalization and copy mechanism to enhance the performance of the sequence-to-sequence framework. Konstas and Lapata (2013) and Wiseman et al. (2018) propose to use template based methods to generate the text by using the extracted template information in the training set. Cheng et al. (2020) propose to generate text description for entities by utilizing the knowledge distilled from the existing knowledge base. However, none of the above works consider the topic information in the KB-to-text generation process and thus not directly comparable to our work proposed in this paper.
Some works in text generation (Gatt and Krahmer, 2018) have been proposed to incorporate the topic information to help generate the text. These ideas can be adopted in KB-to-text generation. Tars and Fishel (2018) and Johnson et al. (2017) add an extra topic tag into the source sentence for incorporating the topic information into the model. The whole model is built based on the sequence-to-sequence (Sutskever et al., 2014;Klein et al., 2017) framework with standard attention Luong et al., 2015). Mikolov and Zweig (2012) as well as Liu et al. (2015) propose to use the topic information as extra features to enhance the performance of the language model and word embedding. Chen et al. (2016) and Ou et al. (2018) use the same idea to utilize the topic feature to enhance the generation of the text. However, all these methods assume that the topic information is known in advance.
Some methods investigate the problem setting that the topic information is not given and needs to be detected. For example, the topic information can be detected from Latent Dirichlet Allocation (LDA). Zhang et al. (2016;Dziri et al. (2018;Wang et al. (2019) detect the topic distribution of words via topic model to enhance the translation procedure. Xing et al. (2017) propose a TA-Seq2Seq framework which uses the word topic information from LDA to generate the responses in chatbot dialog systems. Moreover, some researchers propose to directly detect the topic vector from the input sentences in the sequence-to-sequence framework. For example, Choudhary et al. (2017) propose to train a classifier to predict the topic of the source sentence and use it to help generate the dialog response. Ou et al. (2018) also propose to predict the topic vector directly from the input sequence. Dathathri et al. (2020) propose to use the topic information as a reward function. However, none of the existing works can capture the dynamic topic information suitable for the KB-to-text generation problem.

Our Framework
The KB-to-text generation task aims to generate one or more sentences based on a given set of triples. For example, given a triple group { Bill Gates, BirthPlace, Seattle , Bill Gates, FounderOf, Microsoft } as input, we aim at generating a sentence such as "Bill Gates, the founder of Microsoft, was born in Seattle.". Formally, the input is a set of triples which can be denoted as { h 1 , r 1 , t 1 , h 2 , r 2 , t 2 , · · · , hñ, rñ, tñ }, in which h i , r i , t i stands for the ith head, relation and tail entity respectively.ñ is the number of the triples. The goal is to maximize the conditional probability of the generated text (s 1 , s 2 , · · · , s m ) given such input in the training set. We denote k i = h i , r i , t i , The problem can be expressed as: max θ p θ (s 1 , s 2 , · · · , s m |{k 1 , k 2 , · · · , kñ}), in which θ denotes all the parameters in the model and m is the length of the generated text.

Bill Gates BirthPlace Seattle
Gates Seattle was born in <eos> Bill

Topic Attention
Global Topic Bank Topic Memory State Tracker h 4 x 4 x 3 x 2 x 1 Figure 2: Overview of our framework. Due to the limited space, we omit the traditional attention layer. This figure shows the first time step of the decoding process.
Our proposed framework is shown in Figure 2. It is built on top of a sequence-to-sequence neural structure. The encoder is similar to the standard sequence-to-sequence model while the decoder is equipped with the novel Dynamic Topic Tracker (DTT) model.
Following the idea of (Chisholm et al., 2017), we construct the input by listing all words in the triple elements one after another (i.e. [h 1 , r 1 , t 1 , h 2 , r 2 , t 2 , · · · , hñ, rñ, tñ]) to construct a sequence which is denoted as X = [x 1 , x 2 , · · · , x n ], X ∈ R d×n in which d is the embedding size and n is the length of all input words. x i is the embedding for each word. Afterwards, the sequence is encoded by an encoder (left bottom part of Figure 2) into a hidden context vector h c . We use a stacked LSTM layer as the encoder: where H = [h 1 , h 2 , · · · , h n ], H ∈ R h×n is the output hidden matrix of the input sequence in which each column is the hidden vector representation of an input word. h is the size of the output hidden vectors. h c = h n ∈ R h is the last element of H which can be regarded as a hidden representation of the full input sequence.
We design a novel decoder (depicted on the right hand side of Figure 2) which exploits the dynamic topic information for each generation step. This decoder generates the output sentence based on the hidden context vector h c and the last time step's output vector y t−1 ∈ R d . Therefore, the input and the output are similar to the standard decoder. However, different from traditional decoders, our proposed new decoder contains the DTT model which is capable of detecting the dynamic topics and incorporating these topic vectors into the generation process. More description for DTT will be presented in the next sub-section. The new decoding procedure can be expressed as: in which [h c ; y t−1 ] ∈ R h+d is the concatenation of h c and y t−1 . The DTT model detects the topic vector u t for the current generation step based on this input. [h c ; y t−1 ; u t ] ∈ R h+d+u is the concatenation of the three vectors. It is used as the input vector for the following decoder layer. u is the size of the dynamic topic state vector u t . Attn is a commonly used attention layer for the decoder which is similar with (Luong et al., 2015).

Dynamic Topic Tracker (DTT)
Our proposed DTT aims at capturing the dynamic topic information in each generation step. It represents each topic by a vector which will be learned during training. When given a new set of triples, the model will automatically find the most suitable topic vector for each generation step and use it to guide the prediction of the output words. It contains four components namely the state tracker, the topic attention component, the global topic bank, and the topic memory. The state tracker obtains a hidden representation of the current decoder step's state. The topic attention component utilizes this state to get a local topic state vector. This vector is then fed into the global topic bank to get the local topic representation of the current local topic state. Afterwards, the local topic representation is sent to the topic memory which get a dynamic topic state based on the existing previous states. The details of each component are described in the following sub-sections.

State Tracker
The state tracker is used to capture the status of the current generation step. It is composed of several stacked LSTM layers and will output a state vector. The input sentence hidden vector h c and the output of the last time step y t−1 are concatenated and fed to a stacked LSTM layer which can be expressed as: in which q t ∈ R h is the hidden representation of the current state and is used to calculate the topic representation by the following topic attention component. The state tracker is similar to a decoder. However, there is no dropout layer for the output and the state tracker is trained to capture the state of the current generation step rather than directly predicting the output representation.

Topic Attention
The topic attention uses the current state q t of the state tracker to calculate the attention of each input sequence's hidden representation [h 1 , h 2 , · · · , h n ]. A relevance score is calculated as: in whichq t ∈ R h is the transformed state vector and W q ∈ R h×h , b q ∈ R h is the transformation matrix and the corresponding bias vector. c t ∈ R n is calculated by a simple vector inner product betweenq t and each h i . Each element in c t is the similarity score for each h i . Afterwards, the score vector c t is sent to a softmax layer to calculate the normalized attention to each h i : in which a t is the normalized attention and a t [i] is the ith element of a t . Finally, the topic state is calculated by a weighted sum of each h i by the attention vector a t : r t is the local topic state vector indicating the topic state of the current decoder. The topic attention has a similar structure with traditional attention. The difference is that the output r t will be used to find a new topic vector in the topic bank to make the topic representation more general for each kind of sentence.

Global Topic Bank
The global topic bank acts as a database to store the trained hidden topic representation which is used by all sentences. Note that different from the traditional topic representation which is a word distribution (Blei et al., 2003), our topics are represented by dense vectors. It consists of two matrices of the same size, namely P, Q ∈ R u×l , in which u is the size of the topic vector and l is the number of the topic vectors. l is a hyperparameter that can be assigned by the user. When r t is calculated, it will be used to calculate a similarity score for each topic by a simple vector inner product. The calculation is as follows: in whichw t ∈ R l is a score vector indicating the similarity score of each topic to r t . It will also be normalized with a softmax function to get the probability of each topic: .
The matrix P can be regarded as a projection matrix which spans in the topic state semantic space. After the projection on P, the information in r t that is irrelevant to the topic will be eliminated. The topic is represented by the combination coefficients of topics instead of a simple vector. Afterwards, the topic representation is obtained by the weighted sum of each topic representation with the probability vector w t :ũ in whichũ t is the local topic representation. It should be noted that here we use another matrix Q to get the final topic representation. The reason is that P and Q are the representation of topics in different semantic spaces. Specifically, P represents topics in the topic state space for the state tracker while Q represents topics in the topic representation space for the generation process.

Topic Memory
Sinceũ t is calculated only by the current decoder state, it may lack some historical information. For some words that have no obvious topic information such as "the", "have", it is necessary to refer to the topic of the last step. Therefore, we design a topic memory component to help keep the state of the history topic and it can be used to help build the current topic information. We use a simple RNN to help memorize the history information. The topic memory can be expressed as: in which u t is the dynamic topic state vector andh t is the hidden state of the RNN. The history topic information is stored inh t and will be passed to help the next step's generation.

Dataset
We evaluate our framework on the release v2 of KB-to-text generation dataset WebNLG (Gardent et al., 2017a)

Comparison Models
We compare our DTT with several topic based models. Our main focus will be on those models that can detect topic information (e.g., LDA-S2S and T2S) and consider dynamic topics such as DLDA-S2S. We also compare with models that use additional given topic information (e.g., TopicTag and TopicFeature).  We implement all the baseline models to make them more comparable with each other. The comparison models are as follows: S2S follows the model proposed by Shimorina and Gardent (2018) which uses a standard sequenceto-sequence model with attention.
TopicTag: follows the model proposed by Johnson et al. (2017;Tars and Fishel (2018). It utilizes additional information namely the domain tag provided by the dataset for each sentence to serve as the topic information. Precisely, the domain tag of each sentence is added as a new word at the end of each sentence similar to Johnson et al. (2017;Tars and Fishel (2018). The modified sentences are then fed into the S2S model. TopicFeature follows the model proposed by (Chen et al., 2016;Ou et al., 2018), this model learns a vector representation for each domain tag rather than just adding it as an additional word. For each domain, we denote it as a one-hot vector and the one-hot vector is fed into a feedforward layer to get the topic membership vector, which is then used as the extra feature to predict the decoder output.
LDA-S2S follows the model proposed by Zhang et al. (2016;Xing et al. (2017), we first train an LDA model on the source sentences of the training dataset. Then, the sentence topic is calculated by averaging the topic distribution on each word and it is used as an extra context feature in the decoder similar to the TopicFeature model.
DLDA-S2S follows the model proposed by Mikolov and Zweig (2012;Dziri et al. (2018), we dynamically calculate the topic distribution for each word by summing the source word vectors weighted by the attention. The word topic distribution is calculated similarly to LDA-S2S.
T2S follows the model proposed by Choudhary et al. (2017;Ou et al. (2018). The topic distribution is predicted based on the input sentence hidden vector h c . h c is fed into several linear layers to get the fixed topic representation. Then the fixed topic representation is used as a new context feature in each step of the decoder.

Experiment Setup
We implement our model based on OpenNMT-py 2 , a Python port of OpenNMT (Klein et al., 2017). All the hyper-parameters are tuned on the dev set with grid search. We follow the baseline model's default settings (Gardent et al., 2017a;Gardent et al., 2017b), in which the word embedding size is 500. The size of LSTM hidden vector states is set to 500. We use two layers of LSTMs for both encoder and decoder. For the last layer of the stacked LSTM, we add a dropout layer with a ratio of 0.3. We use SGD as our optimizer with the initial learning rate of 1.0 and the decay rate of 0.5. The commonlyused attention (Luong et al., 2015) is added to all models. When generating a new sentence, we use beam search with beam size 5 which is a traditional setting for generation tasks. We tune the number of topics of our model and comparison models using the dev set. In the TopicFeature model, the number of topics is set to 20. In LDA-S2S, we train the LDA model with the Python based LDA package 3 and  the number of topics is set to 100. In the T2S model, we set the number of topics to 200. In the DTT model, we set the number of topics to 500. We evaluate all the models with the same evaluation script 4 . Several metrics are evaluated, including BLEU (Papineni et al., 2002), ROUGE L (Lin, 2004), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005) and CIDEr (Vedantam et al., 2015). Since some metrics are sensitive to randomness, we run each model for 5 times and report the median score with the standard deviation.

Results
The experimental results are shown in Table 2. We can observe that our DTT model outperforms all comparison models significantly and consistently. It illustrates that the DTT model can capture the dynamic topic information to mitigate the off-topic problem and thus improves the overall generation performance. Besides, our DTT model not only improves performance but also improves the stability of the performance. It can be observed that the standard deviation is almost reduced to half of that in the S2S model and is the smallest in most of the metrics. These results show that our proposed DTT model is more robust against the randomness when generating sentences by incorporating the dynamic topic information.
The T2S model's results show that simply using several linear layers to learn topic information performs no worse than models with annotated tags or pre-trained topic allocation. It illustrates that pure neural models can learn reasonable topic representation. The T2S model performs not as good as our DTT model. The main reason is that it predicts the topic directly by the input sequence and the topic is fixed during the whole generating process. The result of DLDA-S2S with dynamic topic vectors is also better than its counterpart with static topics, i.e. LDA-S2S. All these results show that capturing the dynamic topic can provide more suitable information for text generation.
The TopicTag model and the TopicFeature model outperform the S2S model. However, they fail exceeding other models since the domain tags only give very general and insufficient topic information for each sentence. Therefore, even learning topic information from scratch outperforms these models using domain tags.
We conduct an ablation experiment by investigating our model without the previous topic information (denoted as DTT w/o memory). The performance decreases slightly. It indicates that the topic memory can utilize the historical topic information for learning a better representation for the current topic. Without the topic memory, for those words without explicit topic information, the decoder loses the record of what the current topic is. Besides, the standard deviation increases slightly in all metrics showing that the topic memory also makes the model more robust. The owner of the government of Aarhus is The location of government .
The 3Arena is located in North Wall , California and it is owned by Live Nation Entertainment . Castle novel, language, English language The novel Owen Glendower is a notable team .
The official language in Poland is the English language .

Topic Evolving Analysis
To show that our DTT model does capture the evolving of the topics, we sample two sets of similar triples and generate the corresponding text to illustrate the evolving procedure. We set the topic number to 10 and retrain the model for the sake of easier illustration. We record the topic distribution for each generation step and observe how the topic distribution is changing. The result is shown in Figure 3. At each step, our DTT model predicts one of the topics with a very high probability while other topics only have relatively low probability. This observation complies with our intuition of topics. When we talk about some facts within one sentence, the sentence may contain several topics, but at one time, only one topic is dominating. The dominating topic changes during the generation procedure which illustrates that our model captures the changes of the hidden topics when generating from one triple to another. There are some major topics in each group of triples. For example, the sentences generated in the figure mainly talk about a person. The main topic for both of them is Topic 9. It can also be observed that some topics are allocated to the same place in the two sentences. For example, Topic 6 captures the education background while Topic 9 captures the discovery event.

Case Study
In order to give an intuitive illustration of the off-topic problem and show how this problem is alleviated by our model, we sampled some challenging cases handled by the S2S model together with the result produced by our DTT model. These challenging cases perform not well in both models. Nevertheless, the sentences generated by the DTT model seem much better. The result is shown in Table 3. All these triples are very challenging to handle. Therefore, both models cannot generate the sentence perfectly. However, since the DTT model are guided by the topic information, all the topics of the generated sentences are reasonable, though some generated terms are not quite correct. The S2S model is prone to be misled by some keywords and generates unrelated sentences. For example, consider the triple Sumatra, ethnicGroup, Malays ( ethnic group ) , the S2S model is misled by the beginning tag "Sumatra" and generates a sentence that is totally unrelated to the triple. This result may be caused by the fact that the training set contains many examples related to "Sumatra" and "Asam pedas". In contrast, our DTT model is guided by the detected topic information of each decoding step making it more likely to predict the correct sentence. Such topic information also makes it more robust thus the decoder is less likely to generate sentences randomly. This observation to some extent explains why the performance of the DTT model only has half of the standard deviation of that in the S2S model.

Conclusions and Future Work
In this paper, we recognize the off-topic problem in the KB-to-text problem. We consider to utilize the dynamic topic information to alleviate this problem and improve the generation performance. To achieve this, we propose a DTT model which can learn the hidden representation of the topic information.
During the sentence generating process, it can utilize the learned topic information in the sequence-to-sequence framework for enhancing the generation process. More importantly, the topic information is dynamic for each step in the generation process, and thus enables stronger capability than existing works. Experimental results on a benchmark dataset show that our model can effectively capture the dynamic topic information at each step in the decoder. Despite the promising result our model has achieved, there are some remaining challenges: (1) The model stacks too many LSTM layers which leads the gradient hard to back-propagate if more layers are going to be added. Some new technologies such as the Transformer or gated CNN can be used to tackle this problem. (2) The topic representation only contains information from our training set. Nevertheless, the novel architecture makes it possible to use any set of the topic vectors which can be pre-trained on a larger unannotated dataset.