Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder

Topic models have been prevailing for many years on discovering latent semantics while modeling long documents. However, for short texts they generally suffer from data sparsity because of extremely limited word co-occurrences; thus tend to yield repetitive or trivial topics with low quality. In this paper, to address this issue, we propose a novel neural topic model in the framework of autoen-coding with a new topic distribution quantization approach generating peakier distributions that are more appropriate for modeling short texts. Besides the encoding, to tackle this issue in terms of decoding, we further pro-pose a novel negative sampling decoder learning from negative samples to avoid yielding repetitive topics. We observe that our model can highly improve short text topic modeling performance. Through extensive experiments on real-world datasets, we demonstrate our model can outperform both strong traditional and neural baselines under extreme data sparsity scenes, producing high-quality topics.


Introduction
In addition to formal documents, short texts play an increasingly more important role in the era of information explosion where people could instantly share ideas, feelings, and comments via short text fragments, including tweets, headlines, and product reviews, etc. The latent semantics or topics discovered among these short texts can be utilized in many applications, such as content summarization (Ma et al., 2012), classification (Zeng et al., 2018a), and recommendations (Zeng et al., 2018b;Mehrotra et al., 2013). However, conventional topic models (Blei et al., 2003) work reasonably well on various kinds of long documents, but perform poorly on short texts. The main underlying reason is that the co-occurrence information from short texts is extremely limited as known as the data sparsity sports scores games soccer league tennis ncaa players football sports tennis soccer hockey games football beach match players sports match cup hockey olympic football players sport league sports football sport league games tennis champions club sports football league game tennis players hockey games scores bad additional abstract aspectj behave displayed customise accept abstract behave accept additional bad displayed customise abstract accept behave additional adding long many administration problem which hinders the topic models from learning effective semantics and high-quality topics in a pure unsupervised learning fashion. Therefore, several approaches have been proposed to alleviate this issue. One simple approach is to yield pseudo texts (Quan et al., 2015), so that the conventional topic models can apply, e.g., user data (Weng et al., 2010), hashtags (Mehrotra et al., 2013) and external corpora (Zuo et al., 2016), but auxiliary information is not always available. In another vein, extra structural information or semantics are incorporated with the models. For instance, Biterm Topic Model (BTM) (Yan et al., 2013) directly constructs the topic distributions over unordered word-pairs (biterms); Generalized Pólya Urn-DMM (GPUDMM) (Li et al., 2016) applies auxiliary pre-trained word embeddings to introduce external information from other sources. However, the data sparsity problem of short texts remains to be solved, especially resulting in repetitive and trivial topics. For example, as illustrated in Table 1, we can see several repetitive topics about sports including repeated words like "football", "games", and "tennis", and trivial topics composed of incoherent words are discovered from short texts. These topics are of low quality and could impair the performance of downstream tasks.
In this paper, we aim to design a model that can generate high-quality topics from short texts and is more robust to rigorous data sparsity scenarios without any auxiliary corpus. Different from previous methods, we propose a new Negative sampling and Quantization Topic Model (NQTM) in an auto-encoding framework to address the unsupervised short text modeling problem including two essential and novel methods. First, for short texts, we need peakier topic distributions for decoding since short texts cover few primary topics, like Dirichlet Multinomial Mixture (DMM) (Nigam et al., 2000;Yin and Wang, 2014) that assumes each short text only covers one topic. In the autoencoding framework, a possible and straightforward way is using gumbel-softmax (Jang et al., 2016), but its performance is highly determined by the temperature parameter that necessarily needs to be tuned across topic numbers and corpora; therefore, it may not guarantee high-quality topics. Another way is quantizing the latent representations like VQ-VAE (van den Oord and Vinyals, 2017). Unfortunately, the original quantization of VQ-VAE is for image generation and cannot produce peakier distributions for short text topic modeling. Therefore, we propose the novel topic distribution quantization for short texts by separably mapping topic distributions into an appropriate defined embedding space. With this new method, our model can naturally encourage discretization to flexibly yield peakier distributions for decoding, resulting in much better topic quality performance.
Second, we propose a new negative sampling decoder to improve the topic diversity performance. As mentioned previously, short texts are extremely sparse inputs, so the learning signals are too weak to converge to a good local minimum, notably in an unsupervised learning fashion, leading to repetitive topics. Therefore, instead of using a straightforward log-likelihood objective, we propose a negative sampling decoder with the reconstruction by selecting target words from assigned topics and negative words from the topics that are unlikely to be assigned. It acts as an inductive bias that encourages the topic-word distributions to be pushed away from each other, resulting in a better learning objective for generating diverse topics. The main contributions 1 of this paper can be concluded as • We propose a neural model with a novel topic distribution quantization method to produce peakier distributions for improving short text topic modeling; 1 The code is available at https://github.com/ bobxwu/NQTM • We also propose a negative sampling decoder to enhance the diversity of short text topics instead of conventional log-likelihood maximization; • We conduct comprehensive experiments on real-world datasets and demonstrate that our model can effectively alleviate the data sparsity problem and generate higher quality topics for short texts (more coherent and diverse); • We further discuss the trade-off of short text topic models between topic coherence and diversity in detail and show our model outperforms baselines on both these aspects.

Related Work
Conventional topic models Conventional probabilistic topic models, e.g., Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003), work very well on formal documents with long texts. To improve the performance of short text topic modeling, Biterm Topic Model (BTM) (Yan et al., 2013) and Dirichlet Multinomial Mixture (DMM) model (Nigam et al., 2000;Sadamitsu et al., 2007;Yin and Wang, 2014) are two basic short text probabilistic topic models which employ traditional Bayesian inference methods including Gibbs Sampling (Steyvers and Griffiths, 2007) and Variational Inference (Blei et al., 2017). Several extensions based on BTM and DMM are also proposed, such as Generalized Pólya Urn-DMM (GPUDMM) (Li et al., 2016) with word embeddings and Multiterm Topic Model (Wu and Li, 2019). Besides, Semantics-assisted Non-negative Matrix Factorization (SeaNMF) (Shi et al., 2018) was lately proposed as an NMF topic model incorporating word-context semantic correlations solved by a block coordinate descent algorithm.
Neural topic models More recently, deep neural networks have shown great potential for learning complicated distributions for unsupervised models. Due to the success of Variational AutoEncoder (VAE) (Kingma and Welling, 2014;Rezende et al., 2014), various neural topic models are proposed (Nan et al., 2019;Wu et al., 2020). Neural Variational Document Model (NVDM) (Miao et al., 2016) is the first VAE-based neural topic model that adopts the reparameterization trick of Gaussian distributions and achieves remarkable results on normal text topic modeling. Some extensions like Gaussian Softmax Construction (GSM) have been explored in (Miao et al., 2017). Product of expert LDA (ProdLDA) is proposed by Srivastava and Sutton (2017) using Logistic Normal distribution due to the difficulty of taking the reparameterization trick for Dirichlet distribution, which is important for topic modeling. Topic Memory Network (TMN) (Zeng et al., 2018a) is proposed for supervised short text topic modeling and classification with pre-trained word embeddings, incorporating the neural topic model (Miao et al., 2016) with memory networks (Weston et al., 2014). Different from these neural topic models, the proposed model aims to improve short text topic modeling without any extra information. Our model relies on the novel topic distribution quantization to discrete the latent representations in the auto-encoding framework instead of the VAE assumption. Meanwhile, a new objective under the negative sampling decoder replaces the traditional log-likelihood maximization objective to especially alleviate the data sparsity of short texts.
3 Negative sampling and Quantization Topic Model

A Brief Review of Topic Models
LDA (Blei et al., 2003) is one of the most classic probabilistic topic models. In its formulation, a topic is defined as a distribution of words and each word in a text is drawn from a mixture of Multinomial distributions with Dirichlet distribution as the priori. In LDA, the latent variable z denotes the topic assignment of word x i and θ is the topic distribution of a text. According to the generation procedure of LDA, the marginal likelihood of a text x is where N refers to the number of words in text x, α is the hyperparameter of Dirichlet distribution, β z refers to the topic distribution over words given the topic assignment z and β = (β 1 , . . . , β K ) ∈ R V ×K is the matrix of all topic words probability vectors (V is the vocabulary size and K is the topic numbder). Then, approximation methods, like Variational Inference or Gibbs Sampling, are employed to approximate the intractable posterior.
In a different way, with the help of neural variational inference, neural topic models (Miao et al., 2017;Srivastava and Sutton, 2017) have been proposed to simplify the inference and the model can be directly updated by gradient backpropagation. These models adopt a simplification that the discrete latent variable z is integrated out in the marginal likelihood as (1) Based on these preceding neural topic models, we present our proposed model for short text topic modeling.

Network Architecture
In this section, we detail the proposed Negative sampling and Quantization Topic Model (NQTM). Figure 1 shows the overall architecture including three main parts.

Short Text Encoder
Topic models discover semantic information (topics) among large unlabeled datasets using word co-occurrence, so topic models typically apply the bag-of-words assumption ignoring the sequence for simplification. Thus, we adopt MLPs that are eligible enough for both encoder and decoder. We assume the short text x is in the form of bag-ofwords which produces continuous representations through the short text encoder. We adopt the following simple network structure as our short text encoder: where W 1 and W 2 are linear transformations, and π 1 and π 2 are intermediate outputs; σ(·) means softmax function for normalization and ζ(·) denotes softplus function. After the encoder, we have the lower dimensional representation θ e of the short text x.

Topic Distribution Quantization
Instead of directly feeding the continuous representation θ e to the decoder as previous neural topic models (Miao et al., 2016(Miao et al., , 2017Srivastava and Sutton, 2017), we employ the quantization step ahead. Unfortunately, we find that directly using To this end, we propose a novel topic distribution quantization method to alleviate the data sparsity problem of short texts especially. We first set a discrete embedding space e = (e 1 , e 2 , . . . , e B ) ∈ R K×B where B is the size of the embedding space. To encourage the maximum of distances between embedding vectors and have peakier topic distributions, the first K vectors (e 1 · · · e K ) are initialized with identity matrix and the remaining vectors (e K+1 · · · e B ) are initialized with uniform unit scaling U nif orm − 3/K, 3/K . Therefore, the embedding space e can be written as which can be seen as an extended identity matrix. The continuous representation θ e is mapped to the nearest vector θ q of the embedding space e as In this way, the proposed new quantizing topic distributions method for short texts can make the latent representations separably map to distinguished embedding vectors and flexibly generate peakier topic distributions, which can stimulate our model to tackle the data sparsity and improve the diversity and coherence of topics.

Negative Sampling Decoder
After the topic distribution quantization, θ q is fed to the decoder for reconstruction. It has been found that normalizing topic words probability matrix β, such as σ(β), results in trivial and less discriminative topics (Srivastava and Sutton, 2017). Hence, according to Equation (1), the reconstruction of a word x i in the text x is modeled as Negative sampling algorithm In contrast to the standard decoder with log-likelihood maximization objective function, we propose to take advantage of the negative sampling scheme and formulate a new decoder to generate more diverse topics. Similar ideas are mentioned in some data sparsity fields like collaborative filtering (Liang et al., 2018) where if for a short text , the negative samples simply are all the words that do not exist in it. But this method is unable to distinguish the words from different topics explicitly.
Thus, instead of applying this simple solution, we further propose the negative sampling decoder. We take the words with high probabilities in the other topics but not assigned to the current text fragment as negative samples. The intuition is to strengthen the discrimination between the words drawn from the assigned topic distribution and a negative draw from other topics that are not assigned to the text. Therefore, we introduce an inductive bias that prompts the topic-word distributions to be pushed away from each other. In the meantime, the neural model benefits from a better learning signal other than the ordinary softmax loss. As shown in Figure 1, given a short document and its topic distribution, we first remove the top t probable topics and sample one negative topic z neg from the left (K − t) topics with equal probability, which is z neg ∼ Mult(p, 1) where p = (p 1 , p 2 , . . . , p K ) and p k is the probability of choosing topic k, defined as Therefore, z neg represents a topic that the document is unlikely to cover because of its low probability to be assigned. Then, we generate M words from β zneg by TopK function as where x neg denotes the M words that topic z neg is more likely to contain. But since the document is supposed to not cover z neg , the decoder should avoid generating them during reconstruction. This heuristic acts as a positive bias to help the model discover high-quality topics and the negative samples x neg can amplify the learning signals for better optimizing the neural model and improving topic diversity.
Objective function With the negative sampling decoder, we can then construct our objective function. The reconstruction error and the negative sampling error are where x (i) refers to the i-th short text in the corpus. As indicated previously, θ e means the latent representation outputted by the encoder for x (i) and θ (i) q is the discrete representation after the new topic distribution quantization part. We apply the cross-entropy between inputs x (i) and σ(βθ (i) q ) to calculate the reconstruction error. For the negative sampling error, we also use the cross-entropy between x (i) neg and (1 − σ(βθ (i) q )) to enrich learning signals. Therefore, the overall training objective with the negative sampling decoder can be written as where Θ means all parameters and D is the number of texts in a corpus. In order to minimize the distance between the embedding vector θ (i) q and the encoder output θ (i) e , training objective includes the l 2 regularization between them. In detail, λ is a hyper parameter and sg(·) operator means the stop-gradient operation defined as sg(x) = x forward pass 0 backward pass that blocks gradients from flowing into its argument. The above is the architecture of our proposed model NQTM and moreover, we name a simple variant of NQTM without the negative sampling error L neg as Quantization Topic Model (QTM). From the above description, our model NQTM differs from the VQ-VAE in two aspects. First, instead of a standard decoder, our model includes the new negative sampling decoder. Second, a novel topic distribution quantization method is proposed particularly for short texts to yield sharper distributions. These approaches are both to alleviate the data sparsity issue and we demonstrate the effectiveness of these two technical contributions in the next sections.   the challenge data published in Kaggle 3 . We use the dataset containing randomly selected 20,000 question titles provided by Xu et al. (2015). Each question title is annotated with an information technology name like "matlab", "osx" and "visual studio" as labels.
• TagMyNews Title 4 This dataset contains titles and contents of Engish news released by Vitale et al. (2012). We utilize the news titles as short texts in our experiment. Each news is assigned with a ground-truth label, e.g., "scitech", and "business", etc.
• Snippet 5 This dataset is provided by Phan et al. (2008) composed of the web content from Google search snippets. Eight labels are included in this dataset, such as "Culture-Arts-Entertainment" and "Computers", etc.
• Yahoo Answer 6 We obtained this dataset from Zhang et al. (2015) through the Yahoo Webscope program, including question titles, contents, and best answers. We adopt the question titles for topic modeling, totally containing ten labels.
To preprocess the raw content, we conduct the following steps: (1) tokenize each text and remove non-Latin characters and stop words by using NLTK 7 ; (2) filter out short texts with length less than 2; (3) remove words with document frequency less than 5; (4) convert all letters into lower cases. The statistics of each dataset after preprocessing are summarized in Table 2.

Baseline Models
We take both conventional and neural topic models as baselines for comparison. For traditional topic models, we consider LDA (Blei et al., 2003), BTM 8 (Yan et al., 2013), DMM 9 (Yin and Wang, 2014), GPUDMM 10 (Li et al., 2016), and SeaNMF 11 (Shi et al., 2018). Note that SeaNMF is the state-of-theart conventional model. In terms of neural topic models, we compare our model with NVDM 12 (Miao et al., 2016), GSM (Miao et al., 2017) and ProdLDA 13 (Srivastava and Sutton, 2017). Recently proposed supervised model TMN 14 (Zeng et al., 2018a) is also taken into consideration. We also compare our model with VQ-VAE to demonstrate the effectiveness of our proposed topic distribution quantization method.

Topic Quality Evaluation
Topic Quality Metrics As mentioned before, the challenge of data sparsity in short texts results in two problems: generated topic words tend to be incoherent (trivial topics), and highly similar topics with repeated words are also yielded (repetitive topics). Therefore, we focus on the evaluation of topic quality referring to these two aspects, topic coherence and diversity. Topic coherence metrics depend on co-occurrences of topic words learned by models in the external corpus assuming that coherent words should co-occur within a certain distance. A new topic coherence metric C V was introduced by Röder et al. (2015), which has been proven to perform better than other coherence metrics like widely-used NPMI (Bouma, 2009;Newman et al., 2010;Chang et al., 2009) and UMASS (Mimno et al., 2011). According to Krasnashchok and Jouili (2018), given a topic z and its top T words (x 1 , x 2 , ..., x T ) sorted by the probability, the definition of C V is where s cos (·) means cosine similarity function and the vectors are defined as Then, the NPMI is calculated as where p(x i ) is the probability of x i , p(x i , x j ) the coocurrance probability of x i , x j within a window in the reference corpus and is used to avoid zero. We use the public tool 15 to compute C V provided by Röder et al. (2015). Besides C V score, we employ the topic unique metric (T U ) (Nan et al., 2019) for topic diversity evaluation. For the top T words of topic z, it is defined as where cnt(x i ) is the total number of times that word x i appears in the top T words of all topics. Therefore T U score ranges from 1/K to 1 and a higher value means the generated topics are more diverse due to fewer duplicated words across other topics. It is crucial to note that in general, higher T U scores tend to cause lower C V scores 15 https://github.com/dice-group/ Palmetto because coherent words seldom repeat, and higher C V scores often lead to lower T U scores because coherent words frequently repeat across topics. We show our model can achieve significantly better performance on both aspects in the following. Table 3 reports the topic coherence (C V ) and unique scores (T U ) of the top 15 words under topic number K = 20 and 50. To be more specific, when topic number K = 20, NQTM can achieve significantly higher C V scores, and we notice that T U scores of NQTM reach the highest on all datasets. When K = 50, NQTM still surpasses all unsupervised baselines on StackOverflow and TMN title in terms of both T U and C V scores. Although C V scores of ProdLDA and BTM are higher on Snippet and Yahoo Answer, T U scores of NQTM are much better than them. As mentioned earlier, the reason is that the C V scores can be easily tricked by the repetitive topics composed of prominent words while with low topic diversity performance (further illustrated in Section 5.4). This issue is evenly severer for TMN. Notably, we can see T U scores of TMN are among the worst of all baselines, which is because the diversity of topics learned from TMN is not encouraged with the strong learning signal from the classification loss. Although some discovered topics seem coherent from the above baselines, unfortunately, many repetitive and less informative topics are ineffective in downstream applications; thus, their higher C V scores are meaningless. On the contrary, we can observe the topic diversity performance of NQTM is clearly superior with high coherence performance at the same time, which demonstrates the effectiveness of our model to alleviate the data sparsity problem.

Ablation Study
To conduct an ablation study, we also compare NQTM with VQ-VAE and QTM in Table 3. We can notice VQ-VAE sometimes has higher C V scores, but as indicated in Section 5.1, it is useless because of its much lower T U scores. However, we can see QTM clearly has higher T U scores than VQ-VAE. This is because our new topic distribution quantization can separably distinguish topic distributions from different topics, while VQ-VAE cannot and leads massive texts under different topics to map to the same embedding vector. This contrast shows the effectiveness of our new topic distribution quantization method. Moreover, compared to QTM, we can see NQTM performs comparatively better on C V scores and achieves obvious improvements on T U scores. This is because our negative sampling decoder provides extra learning signals to encourage topic-word distributions to differ from each other, bringing about better topic diversity performance. The change of T U scores of QTM and NQTM along training epochs is shown in Figure 3 that illustrates the T U score of NQTM gradually becomes higher than QTM during the training process. It is necessary to note that one advantage of QTM over NQTM is that QTM is faster on training since the negative sampling error is not required. According to the above comparisons between VQ-VAE, QTM and NQTM, we can observe that our proposed new topic distribution quantization and negative sampling decoder are effective in improving the topic quality of short texts.

Data Sparsity Analysis
Since data sparsity is the essential challenge of short text topic modeling, to further demonstrate the advantages of our model, we explore the topic coherence and diversity performance under varying data sparsity degrees regarding two aspects, topic numbers (K) and minimum document frequencies (min-df) in preprocessing (see Section 4.1). Experimental results of NVDM, ProdLDA, SeaNMF are reported as these baselines perform relatively better in traditional and neural topic models respectively. Figures 2a and 2b show the C V and T U scores of StackOverflow with topic number K ranging from 10 to 100. We can see although the T U scores of all models tend to decline due to the lack of word co-occurrences, NQTM decreases much slower than others by a large margin and also surpasses other baseline models in terms of C V . Figures 2c and 2d present the C V and T U scores of StackOverflow preprocessed by different min-df, from 0 to 10 under K = 50. Note that data sparsity becomes severer when preprocessing corpora with a bigger min-df. We can see that NQTM remains higher C V scores than others and especially, T U scores of baselines fall sharply while NQTM still obviously keeps up.
Based on the above results under various data sparsity conditions, we can conclude that NQTM is grossly more robust in tackling the data sparsity challenge of short texts, which means NQTM can be better utilized in downstream applications. Models Topic Word Examples DMM able abort absolute abstract accept accepts able abort absolute abstract accept accepts wiki wikipedia encyclopedia film article movie movie movies film com imdb news reviews oscar academy movies movie picture winners GPUDMM qt library using matlab project use widget mac os qt osx windows application using oscar academy awards com movie winners award movie film com movies news reviews films movie movies imdb film title celebs encyclopedia SeaNMF cocoa window text menu button item focus application cocoa context without getting running oscar academy awards com winners award movie movie film com movies news reviews films movie movies imdb film title celebs encyclopedia NVDM featuring conducts homes hole creates aspects hand hear serve spanning compliance topix breakthrough continually rule progressive remedy ankle yet dry gum pink interview added lamp construct natural arrows width correct ProdLDA music romantic pop rock movie comedy movies music movie romantic pop movies comedy movie celebrity movies favorite youtube episode intel duo athlon core parallel processor memory intel processor memory cache ram pentium core NQTM mac os leopard snow installing osx installer qt widget signal slot signals creator slots cocoa interface builder events nsview app movie movies character actor scripts actors core intel processor pentium dual processors

Topic Examples Evaluation
To qualitatively illustrate the high-quality topics generated by our model, Table 4 presents the examples of topic words yielded by DMM, GPUDMM, SeaNMF, NVDM, ProdLDA, and NQTM in one experiment. We can observe that baseline models generate some repetitive topics with repeated words, such as "movie", "qt" and "processor", and although the topics of NVDM seem diverse, they're less informative. However, we can see that NQTM only generates a single coherent topic for each corresponding topic and the topic quality of NQTM is apparently higher. groups and well separately scattered in the canvas, which is because NQTM can generate peakier topic distributions for short text topic modeling. The discretization and separation of the latent space can explain why NQTM is able to achieve higher topic coherence and diversity performance.

Conclusion
In this paper, for short text topic modeling, we propose the Negative sampling and Quantization Topic Model (NQTM) with a novel topic distribution quantization mechanism to yield peakier distributions and a new negative sampling decoder to enrich the learning signals. Experiments on benchmark datasets quantitatively and qualitatively show our model significantly outperforms baselines to overcome the data sparsity problem of short texts. Future works could focus on employing the proposed model in more downstream tasks.