A Topic Augmented Text Generation Model: Joint Learning of Semantics and Structural Features

Text generation is among the most fundamental tasks in natural language processing. In this paper, we propose a text generation model that learns semantics and structural features simultaneously. This model captures structural features by a sequential variational autoencoder component and leverages a topic modeling component based on Gaussian distribution to enhance the recognition of text semantics. To make the reconstructed text more coherent to the topics, the model further adapts the encoder of the topic modeling component for a discriminator. The results of experiments over several datasets demonstrate that our model outperforms several states of the art models in terms of text perplexity and topic coherence. Moreover, the latent representations learned by our model is superior to others in a text classification task. Finally, given the input texts, our model can generate meaningful texts which hold similar structures but under different topics.


Introduction
Text generation is a fundamental task in natural language processing (NLP). Existing methods for text generation are mostly limited in supervised setting and designed for specific applications (e.g., machine translation (Bahdanau et al., 2015), text summarization (Rush et al., 2015)). Several research work (Yu et al., 2017;Zhang et al., 2016) attempts generic text generation based on deep generative models (e.g., GAN, VAE). However, models based on GAN(Generative Adversarial Network) are not able to generate explicit latent codes with salient features of texts. Different from GAN-based models, the VAE (Variational Autoencoder) and its variants can get latent codes of texts with reconstructing texts by its decoder, where it is assumed that the generation process is controlled by codes in a continuous latent space. A natural way to implement the VAEs is to adopt autoregressive networks (e.g., RNNs) as the decoder and encoder. This kind of implementation of VAEs considers sequential information of texts, and is able to model the linguistic structure of texts. But it is not good at modeling semantics (Dieng et al., 2017) and may cause the latent variable collapse problem (Bowman et al., 2016) which leads that the decoder ignores information from the inferred latent codes.
In general, texts inherently contain semantic features and structural features. Only when the latent codes of texts contain structural and semantic information can high-quality texts be generated from the codes. As far as the standard VAE is concerned, it assumes that the latent code is Gaussian distributed so that we cannot distinguish which part of code controls the structure and which part controls the semantics. In other words, it is difficult to generate effective texts by controlling text features directly.
To generate high-quality texts (i.e., with semantic and structural information), we propose a model named TATGM which adopts a sequential VAE to learn structural features of texts and builds a topic modeling component which extracts semantic features of texts. The main contributions of this paper are summarized as follows: • TATGM can capture the semantics of texts by introducing the topic modeling component. The topic modeling component generates words of texts based on a Gaussian distribution which enables us to take full advantage of information shared by the word embeddings. Moreover, the encoder of the topic modeling component is served as a discriminator to force the decoder of the sequence modeling component to generate texts having the semantics as close to the original texts as possible. Specifically, no extra training supervised by labeled data is needed for this discriminator.
• The latent code learned by TATGM is a concatenation of the latent variables of the topic modeling component and the sequence modeling component. By two separate parts of the code, TATGM can control structural and semantic information independently. Thus, we can get texts with the changed semantics or structure by changing one part of the code. One interesting practice is that we can get question-answering pairs with the same structure in different semantic spaces.
• The modeling ability of TATGM is evaluated by extensive experiments in terms of perplexity and the coherence of topics. Furthermore, to verify whether TATGM can learn salient features, a classification task using the latent codes is conducted. Experimental results show that TATGM achieves the best performance while compared with many existing models. Finally, several texts generated by TATGM are demonstrated, which indicates that TATGM can generate different expressions of texts of the same structure in different topics.
The rest of the paper is organized as follows. Section 2 introduces the related work. Section 3 describes the model TATGM in detail. Section 4 gives an experimental evaluation. Finally, the paper is concluded in Section 5.

Related Work
VAEs (Kingma and Welling, 2014) are a type of deep generative models, which learns explicit latent codes of input data in a continuous latent space and generates data with decoders. VAEs can extract the features of input texts by posterior inference with neural networks and reparameterization tricks. VAE-based models have been widely applied in image generation (Gregor et al., 2015), machine translation (Zhang et al., 2016), and knowledge graph reasoning (Zhang et al., 2018). VAE-based models can also be applied in different tasks of text processing, where the input document is treated as its bag of words (Miao et al., 2016;Srivastava and Sutton, 2017;Miao et al., 2017) or a sequential text (Bowman et al., 2016). The former type of models is always related to topics. For example, Miao et al. (2016) builds the connection by letting weights of the softmax decoder to indicate topic interpretability. Although these models overlook sequential information in texts, they can learn effective latent codes with global semantics. On the other hands, VAE-based models with the input of sequences usually adopt RNNs as encoders or decoders. However, latent variable collapse (i.e., the latent variable collapses to obey the distribution of the prior) makes the sequential VAE to generate texts without information of latent code so that the latent code cannot have enough information of text features.
To avoid the latent variable collapse problem, two types of methods are adopted. The first type tackles this problem by weakening the context modeling ability of decoders mostly in the model architecture level. adopts KL annealing which can be seen as gradually adding annealing weight into the objective function.  introduces an additional mutual information term to compensate for the objective function. Xiao et al. (2018) and Zhao et al. (2017) introduce the bag-of-words loss as an auxiliary loss, which measures how well predictions of words can be made from the latent codes. The ideas behind all these methods are same, i.e., to force the models to generate texts from the latent codes so as to make latent codes have more information on text features. Improving the sequential model by incorporating topics have been explored (Lau et al., 2017;Dieng et al., 2017;Mikolov and Zweig, 2012). But these models do not have a code for the text structure. Taking TGVAE  as an example, it guides the generation of the VAE latent code by topic distribution. However, it does not separate the semantic and structure latent codes explicitly. In addition, the models mentioned above use a multinominal LDA which cuts off the possibility of leveraging the semantics in embeddings. As a remedy, Das et al. (2015) and Hu et al. (2012) Figure 1: The holistic structure of TATGM. The dotted arrow denotes that the texts which are generated by the sequence modeling decoder are fed into the topic modeling encoder as the input of the discriminator.
propose to adopt a Gaussian-based topic model which assumes each word is generated by a Gaussian distribution. However, their learning algorithms are based on sampling and variational inference, which cannot be assembled in an end-to-end mode.  proposes a text generation model based on VAE, which aims at generating sentences with controllable styles by learning disentangled latent representations. It feeds generated texts into a discriminator to preserve that these texts have the given style and optimizes the generator with the signals backpropagated from the discriminator. Tian et al. (2018) uses a classifier which is trained with small datasets as the discriminator.  adopts a language model as the discriminator to control the style transfer of generated texts.
Different from existing work, our model combines a sequence modeling component and a topic modeling component so that the final latent code of our model contains both topic semantics and sequential structure of texts. In particular, our model can learn these two features simultaneously, since AutoEncoding Variation Bayes (AEVB) are employed to train these two components. Moreover, our models generate bags-of-words text representations from the latent codes, which avoids the latent variable collapse problem.

Model
TATGM is essentially a hybrid autoencoder which comprises a topic modeling component capturing topic information and a sequence modeling capturing structural features. The topic modeling component captures topic information through a Gaussian-based topic model while the sequence modeling component integrates the topic information and generates text. As shown in Fig. 1, the input texts are fed into the two components simultaneously. Each component contains an encoder and decoder, which are labeled by its own color.

Neural Topic Modeling Component
Similar to existing topic modeling methods, we treat bag-of-words representations of texts as input. However, since the encoder in the topic modeling component is expected to be as a discriminator and guarantee that texts generated by the sequence modeling decoder have specific topic information, the encoder cannot be with the discrete representations (e.g., one-hot representation) of texts as input. It is because such discrete representations of texts cannot be compatible with backward gradients from the discriminator. Therefore, we employ the embeddings of words as input, considering that the word embedding space is continuous and the similarity between every two words can be calculated by the Euclidean distance. Further, we assume that there are K topics in the corpus and each topic is represented as a multivariable Gaussian distribution with a mean and a variance (e.g., N (µ k , σ 2 k I) for a given topic k). In our neural topic modeling component, a document with n words is represented as where x i ∈ R e denotes the embedding of i-th word of the document. The generative process of the document is as follows.
1. For each document, draw the document-topic distribution z t ∼ Dir(α).

For
According to the generative process above and the marginalization of t i , the likelihood p(x) of the document x can be derived as Further, we adopt AEVB and the reparameterization trick to achieve posterior inference and parameter learning. Specifically, similar to (Srivastava and Sutton, 2017), we first draw a Gaussian random vector by a reparameterization trick and then pass it through a softmax function to parameterize the multinomial document topic distributions. Therefore, we can replace α with the parameters of Gaussian prior µ 0 , σ 2 0 . Thus, we can get the ELBO of our topic modeling component.
In Eq. 2, z t is inferred by the neural network in the encoder. The inference process is detailed as follows. Document x is fed into a two-layer perceptron with ReLU as the activation function. The transformation results of the MLP are then processed by max-pooling, getting fixed-length representation c of the document. Parameters of the posterior µ t and σ 2 t are obtained from two feedforward neural networks. Next, a Gaussian variable is sampled by the reparameterization trick. Finally, z t is obtained by passing p to the softmax function. Our choices are specified as follows.
where f 1 , f 2 denote two feedforward neural networks, W t is a trainable parameter, f sm denotes softmax function.
Besides, KL(q(z t |x)||p(z t |µ 0 , σ 2 0 I)) in Eq. 2 is obtained by calculating the KL divergence between N (µ t , σ 2 t I) and N (µ 0 , σ 2 0 I). In general, during the training of topic models, a smaller vocabulary is built after eliminating some specific words (e.g., stop words, frequent words, and rare words) in the corpus. This is a denoising preprocessing step that makes the model more reliable. However, to make the execution of the discriminator applicable, our model takes all of the words in the document as input while outputting the words in a smaller vocabulary as a topic model does. Besides the reason above, this setting can also be regarded as a regulation of the model which strengthen the ability of modeling documents.
Supposing we have a smaller vocabulary, the document represented by this vocabulary is y = {y i } m i=1 , where m < n is the number of words and y i ∈ R e is the word embedding of the word at position i.
The topic modeling decoder reconstructs the document from z t . We assume that the distribution of every topic is a Gaussian with identity variance matrix. Therefore, p(x|z t ) of Eq. 2 is indeed p(y|z t ) and can be expanded as: To stabilize the training process, when calculate the likelihood, we first normalize each y i and µ k .

Sequence Modeling Component
Since the structural features are closely related to the word sequence information, we construct a sequential VAE. By the sequential VAE, the encoder infers the structural latent variable and the decoder reconstructs texts via integrating the topic latent variable and the structural latent variable. This process is depicted by the bottom half of Fig. 1. Since the semantic information is given by the topic modeling component, the reconstruction loss function makes the encoder of sequential VAE focusing on encoding the structural features. Let z s denote the structural latent variable, the ELBO of this component is in the following form: We adopt a bi-directional GRU as the encoder. The last hidden state of the encoder is used to infer the parameters µ s , σ s of the Gaussian distribution. Then, we get z s by the reparameterization trick.
where f 3 and f 4 are the feedforward neural networks. Further, we obtain the holistic latent code z by concatenating z t and z s , i.e., z = [z t ; z s ].
We adopt a GRU as the decoder to reconstruct document x = {x i } n i=1 and latent code z as its initial state. So, the likelihood of the reconstructed document can be derived by Eq. 5.
where h i denotes the i-th hidden state of the GRU decoder.

Topic Encoder as a Discriminator
Although the holistic latent code contains semantics and structure features, the decoder may not fully leverage the semantic part of code. Besides the reconstruction loss which drives the generator to produce realistic sentences, we introduce a discriminator which enforces the generator to produce texts in a coherent topic with z t . Specifically, we let the encoder in our topic modeling component act as the discriminator. It is expected that topic distributions inferred from the generated texts are similar to the topic distributions of the original texts. However, if the outputs of the sequence decoder are discrete, it is impossible to propagate gradients from the discriminator through the discrete samples. We thus resort to a Gumbel-Softmax distribution as an approximation of the discrete samples.
In detail, in each step of the generative process of texts, we get the distribution of one word p(x i |z s , z t ) = [π 1 , π 2 , ..., π |V | ], and approximate the samples from p(x i |z s , z t ) by where g i and g j are samples from Gumbel − Sof tmax(0, 1). As training proceeds, τ gets close to 0, yielding the increasingly peaked distribution that finally emulate the discrete cases. Thus, the i-th word generated by the decoder is where W v ∈ R |V |×e denotes word embedding. We feed the above generative texts into the topic modeling encoder and expect to get the maximum likelihood of the original topic distribution z t . So the loss function of the discriminator is specified as follows.
Combining three parts of loss functions, we can get the loss function of the model as Eq. 9, where λ D =0.1.
We learn parameters of topic modeling and sequence modeling components alternatively in the multitask learning setting. To avoid the latent variable collapse problem, we add weights λ t , λ s to KL related terms in the loss function and make the weights increase slowly to 1 in training.

Experiments and Result Analyses
To evaluate the text generation ability, we do experiments from four perspectives. First, we evaluate the language modeling ability of the sequence modeling component and the topic coherence of the topic modeling component. Next, to evaluate the effectiveness of the learned latent codes of texts, we perform a semi-supervised classification task. Finally, we transfer the questionanswering pairs on Yahoo dataset to different topics and demonstrate generated texts.

Dataset Description & Experiment Setup
We conduct experiments on five benchmark text datasets: APNEWS 1 , IMDB (Maas et al., 2011), BNC (BNC Consortium, 2007), Yahoo Answer (Yahoo) and Yelp15. We randomly sample 100k as training data and 10k as validation and testing data, respectively. For all datasets, we first tokenize the texts using Stanford CoreNLP (Manning et al., 2014). Then we lowercase all word tokens, and filter words that occur less than 10 times. For Yahoo and Yelp15, we truncate the vocabulary to 20k words for fast training. For the bag-of-words input in the topic modeling component, we further remove stop words, and exclude the top 0.1% most frequent words and also words that appear in less than 100 documents. Table 1 shows summary statistics of all datasets. We fix the max sequence length to 50 for the texts in APNEWS, IMDB, BNC and 150 for Yahoo and Yelp15. The 300-dimentional embeddings of words are shared by two components in our model. For the topic modeling component, we adopt a 2-layer MLP with 200 hidden units and ReLU as its activation function. We set the size of z t to 50. For the sequence modeling component, we adopt a bidirectional single layer GRU with 600 hidden units (300 in each direction) as the encoder and a unidirectional GRU with 300 hidden units as the decoder. The size of z s is set to 20. We use a batch size of 32 and train the model up to 40 epochs. Linear scheduling is used in the KL annealing and the weight grows from 0 at the beginning to 1 at 40k steps.

Sequence Modeling Evaluation
By the experiments on five datasets, we evaluate the language modeling ability in terms of perplexity (PPL). We compare several baselines including models based on language model (i.e., LSTM LM, LSTM+LDA, Topic-RNN, TDLM) and models based on VAE (i.e., LSTM VAE, VAE+HF, TGVAE, DCNNVAE, DVAE). LSTM LM is a plain language model implemented in LSTM. LSTM+LDA concatenates the hidden states with the topic distribution learned by a pre-trained LDA. Different from ours, LDA in LSTM+LDA is trained separately. Topic- RNN (Dieng et al., 2017) learns an LDA with a language model jointly and incorporates the topic distribution by a gate mechanism. TDLM (Lau et al., 2017) incorporates a convolutional topic model and also leverages the topic distribution in the same way that LSTM+LDA does. LSTM VAE is a standard VAE whose encoder and decoder are implemented by two LSTMs, respectively. VAE+HF (Wang et al., 2019) is a VAE with a mixture-of-Gaussians prior with Householder Flow. TGVAE (Wang et al., 2019) is a VAE guided by a Gaussian mixture distribution as prior with a jointly-trained LDA. DCNN-VAE ) is a VAE using dilated CNN as its decoder. DVAE (Xiao et al., 2018) uses a Dirichlet latent variable to improve VAE. Besides the model we propose, we also evaluate our model without the discriminator (i.e., Ours w/o Dis).
The perplexity of VAE-based models is estimated in ELBO approximately which is comprised of a reconstruction term and a KL term. Besides the perplexity, we report the KL term in the VAE based models. For our model, we report the KL values of the sequence modeling component and the topic modeling component in the first and second rows, respectively.
For a fair comparison, the compared results are picked from the models with 50 topics. The results are shown in Table 2. We find that our model and our model without a discriminator occupy the top-2 positions. We attribute the improvements to the decoupling of semantic and structural features. From the results, we also verify that the discriminator in our model can help in decreasing the perplexity. In addition, the KL values in the topic modeling component is much larger than the sequential one. One possible reason is that the topic information reveals much of the diversity of texts.

Topic Coherence Evaluation
Topic models are traditionally evaluated using perplexity. However, (Chang et al., 2009)    perplexity does not correlate with the coherence of the generated topic. We adopt normalized PMI (NPMI) to evaluate the topic coherence following (Lau et al., 2017). Given the top-n words of a topic, coherence is computed based on the sum of pairwise NPMI scores between topic words. We average topic coherence over the top 5/10/15/20 topic words. To aggregate topic coherence scores, we calculate the mean coherence over topics. In the experiments, the number of topics remains 50 among all baselines. From Table 3, we find that the discriminator gives little improvement. It is because that the role of the discriminator is to take the topic distributions as the supervised signals to improve the generation so that the sequence decoder can generate more topic relevant texts. That is, the discriminator does not improve the topic modeling component itself. Besides topic coherence values, to understand the topics concretely, we also provide top five topic words from eight randomly chosen topics on each dataset in the supplementary material.

Semi-supervised Classification
To evaluate whether latent codes incorporate text features, we perform a semi-supervised classification task and compare our model with the other models. To make a comprehensive comparison,  we use Yahoo, Yelp15 as well as 20NEWS for text classification. Here, 20NEWS is a collection of forum-like messages from 20 news-groups categories.
For any model to be evaluated, we first train it by the training documents of the dataset, and then executing the well-trained model to obtain latent codes of all documents in the dataset. Next, we sample 2,000 documents from the training data and train a 2-layer softmax classifier using these documents and their category labels. The accuracy on the testing set of documents is shown in Table  4. Since in our TATGM the latent code comprises two latent variables, we explore the accuracy on the two latent variables solely and collectively.
As shown in Table 4, our model which combines the two latent variables achieves the highest accuracy. For 20NEWS and Yahoo, the model only using the topic latent variable is better than that only using the structural latent variable. Since these two datasets are labeled by the topics, the combination improves little. For Yelp15 dataset, the performances of our two latent variables are similar whereas the combination improves up to  15%. The reason may be the fact that the Yelp15 dataset is labeled by the sentiment which is determined by not only single terms but also n-grams. For example, good and not good represents opposite sentiments. Our sequence modeling component can capture such kind of features more accurately.

Topic Transfer between Question-answering Pairs
In our model, each dimension of the latent variables in the topic modeling component corresponds to a topic, therefore we can manipulate the variables manually to verify whether the generation can express a given topic while remaining the same structure. Specifically, given a document and its latent code, we change the topic part of the code and keep the structural part of the code unchanged. Then we can check the text the decoder generated from the whole latent code. We conduct the experiments on the Yahoo dataset where each item contains one questionanswering pair and one label corresponds to its topic. At first, we treat each question as a single sentence to check whether the generated texts satisfy our assumption. From the second column of Table 5, we find that the generation can express a different topic while maintaining the original question structure.
We further try to transfer the questionanswering pair from its original topic to target topics. We find that the model not only transfers topic of two sentences but also produces reasonable question-answering pairs, i.e., the new answer is meaningful to the new question.This is helpful for the automatic question-answering scenarios. Table 5 shows an example of topic transfer. The first row lists the original question-answering pair which are in the topic of Society&Culture. The rest rows show the generated question-answering pairs when we change the original topic to a target topic while keeping the structure latent variable unmodified. Specifically, we modify the original topic distribution to a new distribution whose dimension of the target topic is 1 and others are 0s. From the results, we can observe that while we change the topic to another, the generated questions remain almost the same structure as the original ones. Moreover, the generated answers are also transferred to the target topic and they together with the generated questions compose reasonable question-answering pairs. More examples are shown in the supplementary material.

Conclusion
In the paper, we present the text generation model TATGM. The model can learn semantics and structural features simultaneously. Moreover, the model employs a discriminator to ensure that the generated text is more coherent with the given topic information. Experimental results show that our model has a better text modeling ability than several state-of-the-art methods and learns disentangled latent representations for texts which shows the superiority in a classification task. Specifically, our model can generate meaningful question-answering pairs, which provides an alternative transfer learning way and helps to broaden 5098 the knowledge in other fields.