An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering

Clustering short text streams is a challenging task due to its unique properties: infinite length, sparse data representation and cluster evolution. Existing approaches often exploit short text streams in a batch way. However, determine the optimal batch size is usually a difficult task since we have no priori knowledge when the topics evolve. In addition, traditional independent word representation in graphical model tends to cause “term ambiguity” problem in short text clustering. Therefore, in this paper, we propose an Online Semantic-enhanced Dirichlet Model for short sext stream clustering, called OSDM, which integrates the word-occurance semantic information (i.e., context) into a new graphical model and clusters each arriving short text automatically in an online way. Extensive results have demonstrated that OSDM has better performance compared to many state-of-the-art algorithms on both synthetic and real-world data sets.


Introduction
A massive amount of short text data is constantly generated with online social platforms such as microblogs, Twitter and Facebook. Clustering of such short text streams has thus gained increasing attention in recent years due to many real-world applications like event tracking, hot topic detection, and news recommendation (Hadifar et al., 2019). However, due to the unique properties of short text streams such as infinite length, evolving patterns and sparse data representation, short text stream clustering is still a big challenge (Aggarwal et al., 2003;Mahdiraji, 2009). *

*Corresponding author: Junming Shao
During the past decade, many approaches have been proposed to address the text stream clustering problem from different points of view, and each method comes with specific advantages and drawbacks. Initially, traditional clustering algorithms for static data were enhanced and transformed for text streams (Zhong, 2005). Very soon, they are replaced by model-based algorithms such as LDA (Blei et al., 2003), DTM (Blei and Lafferty, 2006), TDPM (Ahmed and Xing, 2008), GSDMM (Yin and Wang, 2016b), DPMFP (Huang et al., 2013), TM-LDA (Wang et al., 2012), NPMM (Chen et al., 2019) and MStream (Yin et al., 2018), to mention a few. However, for most established approaches, they often work in a batch way, and assume the instances within a batch are interchangeable. This assumption usually cannot hold for topic-evolving text data corpus. Determining an optimal batch size is also a non-trivial task for different text streams (Howard and Ruder, 2018).
Additionally, unlike long text documents, short text clustering further suffers from the lack of supportive term occurrence to capture semantics (Gong et al., 2018). For most existing short text clustering algorithms like Sumblr (Shou et al., 2013), DCT  and MStreamF (Yin et al., 2018), exploiting independent word representation in their cluster models tends to cause ambiguity. Let us show the following four tweets, for example: T1: "A regular intake of an Apple can improve your health and muscle stamina." T1: "A glass of fresh apple juice is recommended for breakfast." T2: "New Apple Watch can monitor your health." T2: "Apple will launch new smartphone iPhoneX this december." Tweets of these two topics share few common terms, i.e., 'health' or 'apple'. It creates an ambiguity if the model deals with only single term representation to calculate the similarity. However, the co-occurring terms representation (i.e., context) helps a model to identify the topic 1 correctly.
To solve these aforementioned issues, we propose an online semantic-enhanced dirichlet model for short text stream clustering. Compared to existing approaches, it has following advantages. (1) It allows processing each arriving short text in an online way. The online model is not only free of determining the optimal batch size, but also lends itself to handling large-scale data streams efficiently; (2) To the best of our knowledge, it is the first work to integrate semantic information for model-based online clustering, which is able to handle "term ambiguity" problem effectively and finally support high-quality clustering; (3) Equipped with Poly Urn Scheme, the number of clusters (topics) are determined automatically in our cluster model.

Related Work
During the past decade, many text stream clustering algorithms have been proposed. Here, due to the space limitation, we only report some model-based approaches which are highly related to our work. For more details, please refer to comprehensive surveys, e.g., (Mahdiraji, 2009;Silva et al., 2013;Nguyen et al., 2015;Aggarwal, 2018).
The early classical attempt for text clustering is Latent Dirichlet Allocation (LDA) (Blei et al., 2003). However, it cannot handle the temporal data for text streams. For this purpose, many LDA variants have been proposed to consider the text streams such as dynamic topic model (DTM) (Blei and Lafferty, 2006), dynamic mixture model (DMM) (Wei et al., 2007), temporal LDA (T-LDA) (Wang et al., 2012), streaming LDA (S-LDA) (Amoualian et al., 2016), and dirichlet mixture model with feature partition (DPMFP) (Zhao et al., 2016). These models assume that each document contains rich content, and thus they are not suitable for dealing with the short text streams. Later, Dirichlet multinomial mixture model-based dynamic clustering topic (DCT) model was designed to deal with short text streams by assigning each 1 Topic and cluster will be interchangeably used in this paper document with single topic . Very soon, GSDMM was proposed to extend DMM with collapsed gibbs sampling to infer the number of clusters (Yin and Wang, 2014). However, most of these models did not investigate the evolving topics (clusters) in text streams where the number of topics usually evolves over time.
To automatically detecting the number of clusters, (Ahmed and Xing, 2008) proposed a temporal dirichlet process mixture model (TDMP). It divides the text stream into many chunks (batches), and assumes that the documents inside each batch are interchangeable. Later, GSDPMM was proposed with collapsed gibbs sampling to infer the number of clusters in each batch. In contrast to LDA, GSDPMM not only converges faster but also dynamically assigns the number of clusters over time (Yin and Wang, 2016a). However, both TDMP and GSDPMM models do not examine the evolving topics, and, these models process the text stream for multiple times. Thereafter, MStreamF (Yin et al., 2018) was thus proposed by incorporating a forgetting mechanism to cope with cluster evolution, and allows processing each batch only one time. The NPMM model (Chen et al., 2019) was recently introduced by using the word-embeddings to eliminate a cluster generating parameter of the model.
In summary, for most existing approaches, they usually work in a batch way. However, determining optimal batch sizes for different text streams is usually a difficult task. More importantly, due to the intrinsic sparse data representation of shorttext data, the semantics, is little investigated in established approaches. Actually, they need to be carefully considered to decrease the term ambiguity in short text clustering.

Preliminaries
Here, the problem statement is first given, followed with a brief introduction about dirichlet process and Poly Urn scheme.

Problem Formulation
Formally, a text stream is continuous arrival of text documents over time: S t = {d t } ∞ t=1 . Where d t denotes a document arrived at time t. Each document contains specific words d t = {w 1 , w 2 , . . . , w n } and may have different length. The key objective of the clustering task is to group similar documents into clusters: Z = {z t } ∞ t=1 , and each clus-ter z t contains documents represented as z t = {d zt 1 , d zt 2 , . . . , d zt n }. For short text clustering, each document is the member of only one topic, so z i ∩ z j = φ, where i = j.

Dirichlet Process
Dirichlet Process (DP) is a non-parametric stochastic processes to model the data (Teh et al., 2006). It is the process to draw a sample from (base) distribution, where each sample itself is a distribution, denoted as N ∼ DP(α, N 0 ). Here, N is the drawn sample from the base distribution N 0 . The drawing procedure of a sample from the distribution is controlled by a concentration parameter α.

Poly Urn Scheme (PUS)
The procedure to draw the sequential samples N 1 , N 2 . . . from a distribution is described by the poly urn scheme (Blackwell et al., 1973). It can be summarized as: Initially, the urn is empty, so we draw a color from the base distribution i.e. N 1 ∼ N 0 , and put a ball of drawn color into the urn. In the next turn, either we draw a color from the distribution which is already drawn with probability of n−1 α+n−1 , or draw a new color with probability of αN 0 α+n−1 . Since, drawing samples from distribution is repeated, so the same color may appear more than once. This defines that we have K number of distinct colors and n number of draws. This condition is defined by a well-known process called Chinese restaurant process (CRP) (Ferguson and Thomas S Ferguson, 1973). In CRP, we suppose that there are infinite number of tables in a restaurant, and each table surrounds infinite number of empty chairs. The first customer sits on first table, and later on the next customer either chooses to sit on any occupied table with probability of n k α+n−1 or chooses an empty table with probability of α α+n−1 . Here, n k is number of customers sitting on a specific table.
A new customer is tend to be attracted towards a highly crowded table. This phenomenon is one part of our equation to understand creation of clusters over time. The CRP represents the draws from distribution G, while the stick-breaking process shows the property of G explicitly: The mixture weights θ = {θ k } ∞ k=1 can be formalized by θ ∼ GEM (γ) (Neal, 2000). We exploit Equation (1) for the generative process of the Dirichlet process multinomial mixture model (DPMM) as follows.
Here, z d is the assigned documents to the cluster, which are multinomial distributed. The probability of document d generated by topic z is summarized as: Here, the naive Bayes assumption is considered where words in a document are independently generated by the topic. Whereas, the sequential draw of the sample can be derived by following the CRP. It is also assumed that the position of words in a document is not considered while calculating the probability.

Proposed Approach
This section gives a brief discussion about the representation and formulation of the proposed algorithm.

Model Representation
We build our model upon the DPMM (Yin and Wang, 2016a), which is an extension of the DMM model to deal with evolving clusters. We call our model as OSDM (Online Semantic-enhanced Dirichlet Model), aiming at incorporating the semantic information and cluster evolution simultaneously for short text stream clustering in an online way. The graphical model of OSDM is given in Figure 1a.
We show two major differences in our model to highlight the novelty. First, for word-topic distribution, we embed semantic information by capturing the ratio of word co-occurrence. Thereby, independent word generating process and word co-occurrence weight are well considered in topic generation. Secondly, our model works instance

Model Formulation
Defining the relationship between documents and clusters is the most crucial task while dealing with the text stream clustering problem. The thresholdbased methodology (Nguyen et al., 2015) adapts similarity measures to define the homogeneity threshold between a cluster and a document. If the dissimilarity between the exiting clusters and a new arriving document is above the threshold, then a new cluster is created. However, due to the dynamic nature of the stream, it is very hard to define the similarity threshold manually.
In contrast, we assume that documents are generated by DPMM (see Section 3). Most recent algorithm MStreamF improved DPMM to cluster short text documents in the stream. As a further study, we integrate the semantic component in DPMM model. Additionally, we integrate term importance on the basis of cluster frequency. The derived equation for calculating the probability of a document d choosing existing cluster z is given in Equation (3).
The first term of this Equation mz D−1+αD represents completeness of the cluster. Here, m z is the number of documents contained by the cluster z and D is the number of current documents in active clusters 2 . Whereas, α is the concentration parameter of the model. The middle term of the equation based on multinomial distribution (see Equation (2)) with psuedo weight of words β defines the homogeneity between a cluster and a document. N d and N w d represents total number of words and term frequency of word w in document d, respectively. The symbol n w z is the term frequency of the word w in the cluster z. The current vocabulary size of the model is represented by V . n z is the number of words in the cluster z. ICF w calculates the term importance over the active clusters in the model, which is defined as follows.
Here, |Z| represents the number of active clusters in the model. The denominator part of Equation (4) is the number of those cluster which contains the word w. The term 1 + w i ∈d∧w j ∈d cw ij defines the semantic weight of term co-occurrence between the cluster and a document. Formally, we define a value of an entry cw ij in the co-occurrence matrix as follows.
Here, n d z is frequency count of word w i in document d . The ratio between w i and w j must satisfy the property cw ij + cw ji = 1 . We calculate the term co-occurrence weight of those terms which are common in the cluster z and document d. Term cooccurrence matrix is constructed where two terms are co-occurred in a single document. Therefore, if the size of cluster feature set (discussed in Section 4.3) is |V z |, then it is not necessary that the co-occurrence matrix would be |V z | × |V z |.
So far, we have defined the probability of a document choosing existing cluster, then we have to define the probability for a document to creating a new cluster. By following the DPMM for infinite number of clusters, which transform θ ∼ GEM (γ) into θ ∼ GEM (αD), because the hyper-parameter for the mixture model should be dynamically change over time. Therefore, the probability of creating a new cluster is as follows.
Here, the pseudo number of clusters related documents in the model is represented as αD , and β is the pseudo term frequency of each word (exist in document) of the new cluster.

The cluster feature (CF) set
The similarity-based text clustering approaches usually follow vector space model (VSM) to represent the cluster feature space (Din and Shao, 2020). However, a topic needs to be represented as the subspace of global feature space. Here, we use a micro-cluster feature set to represent each cluster. Namely, a cluster is represented as the summary statistics of a set of words of related documents. In our model, a cluster feature (CF) set is defined as a 6-tuple {m z , n w z , cw z , len z , l z , u z }, where m z is the number of documents in the cluster z, n w z is the number of frequency of the word w in the cluster, cw z is the word to word co-occurrence matrix, len z is the number of words in the cluster z which is sum of all frequencies of words, l z is the cluster weight, and u z is the last updated time stamp.
The desirable addition property of cluster feature allows updating each micro-cluster in an online way.
Definition 1: A document d can be added to a cluster z by using the addition property. m z = m z + 1 n w z = n w z + N w d ∀w ∈ d cw z = cw z ∪ cw d Algorithm 1: OSDM Input: S t : {d t } ∞ t=1 , α : concentration parameter, β : pseudo weight of term in cluster, λ : decay factor Output: Cluster assignments z d len z = len z + len d Here, cw d is word to word co-occurrence of the document, and len d represents the number of total words in the document. The complexity of updating a cluster by adding a document is O(L), where L is the average length of the document. This property is useful to update evolving micro-clusters in the text stream clustering procedure.

OSDM Algorithm
We propose a semantic-enhanced non-parametric dirichlet model to cluster the short text streams in an online way, called OSDM. The proposed algorithm allows processing each instance incrementally and updates the model accordingly.
The procedure of OSDM is given in Algorithm 1. Initially, it creates a new cluster for the first doc-ument and the document is assigned to the newly created CF set. Afterward, each arriving document in the stream either choose an existing cluster or generate a new cluster. The corresponding probability for choosing either of an existing cluster or a new cluster is computed using Equation (6) and (3), respectively. The CF vector with the highest probability is updated using the addition property.
To deal with the cluster evolution (i.e., evolving topics) in text streams, many existing approaches often delete the old clusters by using some of the forgetting mechanisms (e.g., decay rate) (Zhong, 2005;Aggarwal and Yu, 2010;Islam et al., 2019). Instead of deleting old clusters, MStreamF (Yin et al., 2018) deletes old batches. In this study, we investigate the importance of each micro-cluster to handle the cluster evolution problem. Specifically, the importance of each micro-cluster is decreased over time if it is not updated. l z in CF stores weight of each cluster. If the weight is approximately equals to zero, then the cluster is removed from the model, i.e., it cannot capture recent topics in the text stream. For this purpose, we applied the exponential decay function, l z = l z × 2 −λ×( t) . Here, t is the elapsed time from the last update, and λ is the decay rate. The decay rate must be adjusted depending upon the applications at hand. The initial value of l z (See Line 16 of Algorithm 1) is set to 1. Afterward, the importance of microcluster is exponentially decreases over time. We can also store the deleted clusters in a permanent disk for offline analysis.
Complexity Analysis. The OSDM algorithm always maintains the averageK number of current topics (CF sets). Every CF set store averageV number of words in n w z and at most |V z | × |V z | in cw z . Thus the space complexity of OSDM is O(K(V +V 2 ) + V D), where V is the size of active vocabulary and D is the number of active documents. On other side, OSDM calculates the probability of arriving document with each cluster (see Line 6 of Algorithm 1). Therefore, the time complexity of OSDM is O(K(LV )), where L is the average size of arriving document.

Datasets and evaluation metrics
To evaluate the performance of the proposed algorithm, we conduct experiments on three real and two synthetic datasets. These datasets were also used in (Yin and Wang, 2016a;Qiang et al., 2018;Yin et al., 2018;Jia et al., 2018;Chen et al., 2019) to evaluate short text clustering models. In the preprocessing step, we removed stop words, converted all text into lowercase, and stemming. The description of the datasets is as follows.
• News (Ns): This dataset is collected by (Yin and Wang, 2014), which contains 11,109 news title belong to 152 topics.
• Reuters (Rs): Similar to (Yin and Wang, 2016b) we skip the documents with more than one class and obtained the dataset consists of 9,447 documents from 66 topics.
• News-T (Ns-T) and Reuters-T (Rs-T): Naturally, we may find a situation where topics in social media appear only for a certain time period and then disappear. However, the documents of each topic in original dataset is observed for long period of time. Therefore, to construct synthetic dataset we sorted documents datasets by topic in two datasets including Reuters and News. After sorting, we then divide each dataset into sixteen equal chunks and shuffled them.
We adopted five different evaluation metrics for deep analysis of all algorithms, which include Normalized Mutual Information (NMI), Homogeneity (Ho.), V-Measure (VM), Accuracy (Acc.) and cluster Purity (Pur.). We utilized sklearn 4 API to implement these metrics. We compute the measures on overall clustering results (Yin and Wang, 2014). Homogeneity measures that each cluster should have only members of a single class. Whereas, Vmeasure calculates how successfully the criteria of completeness and homogeneity are satisfied. Cluster purity measures the true positive instances in each cluster. The typical NMI measure calculates the overall clustering quality.

Baselines
We have selected four state-of-the-art representative algorithms for stream text clustering to com-pare OSDM (Os). A brief description of these algorithms are given as follows.
(1) DTM (Blei and Lafferty, 2006) is an extension of Latent Dirichlet Allocation which traces the evolution of hidden topics from corpus over time. It was designed to deal with the sequential documents.
(2) Sumblr (Sb) (Shou et al., 2013) is an online stream clustering algorithm for tweets. With only one pass, it enables the model to cluster the tweets efficiently while maintaining cluster statistics.
(3) DMM (Yin and Wang, 2014) is a Dirichlet multinomial mixture model for short text clustering, which does not consider temporal dependency of instances.
(4) MStreamF (Yin et al., 2018) is the latest model to deal with infinite number of latent topics in short text while processing one batch at a time. Two models of MStreamF were proposed, one with one-pass clustering process, and another with gibbs sampling. We refer to the former algorithm as MStreamF-O (MF-O) and the latter as MStreamF-G (MF-G).

Comparison with state-of-the-art methods
In this section, we provide a detailed comparative analysis of OSDM with state-of-the-art algorithms.
The overall results are summarized in Table 1. We report NMI, Homogeneity, v-measure, purity and accuracy of each algorithm. Additionally, we also evaluate the performance of each algorithm over different time-stamps of the stream (see Figure 2). Further, we studied the parameter sensitivity and runtime of OSDM, respectively. From Table 1, we can see that OSDM outperformed all baseline algorithms on almost every dataset in terms of all measures. Here, MStreamF-G yielded much better results on the Ns-T data in terms of NMI measure. The reason behind might be the multiple iterations of each batch in the stream. However, MStreamF-G requires more execution time to process the data. In contrast, our proposed algorithm OSDM processes the data only once. And we can also observe that OSDM achieves the highest NMI in other data sets. In addition, the crucial part of evaluating the cluster similarity is measured by the homogeneity measure. We can see that OSDM outperformed all previous algorithms. It also shows the same statistics except for v-measure of DTM. Likewise, our model generates more pure clusters. Furthermore, to investigate the performance over time, we plot the performance of  all algorithms over time in Figure 2.

Sensitivity Analysis
We perform sensitivity analysis for OSDM with respects to three input parameters: concentration parameter α, β, and decay function parameter λ on the Tweets dataset. From Figure 3a, we can observe the effect of α, which ranges from 9e −3 to 9e −1 . The performance in terms of all evaluation measures is stable over the different values of parameters. The α parameter is responsible for finer clustering, that is why we can observe a little fluctuation in initial values. Figure 3b shows the performance on different values of β, which ranges from 1e −4 to 1e −2 . As we already defined that we modified homogeneity part of the clustering model (see Equation (3)), and β is the related hyper-parameter. We can observe that after a certain range, the values of all the evaluation measure become stable. The crucial point to be observed is the stability of homogeneity on different values of β. Figure 3c shows effect of λ ranges from 9e −4 to 9e −6 . Our model follows the forgetting mechanism on decay factor λ and the clusters are deleted from model when the value is approximately equals to zero. We can observe the performance of OSDM on different decay factors. It can be observed that the behavior of a given evaluation measure is stable over time.

Runtime
To compare the runtime of different algorithms, we performed all experiments on a PC with core i5-3470 and 8GB memory. Figure 4 shows the runtime of all algorithms on the tweets dataset. We can observe that Sumblr required the highest execution time to cluster the instances. Whereas, the runtime of other algorithms are comparable. Due to simple execution process of each instance MStreamF-O took least time because it does not need to maintain semantic similarity. Comparatively, MStreamF-G required much higher time than OSDM. The reason is that it needs to execute each batch data multiple times. Due to online nature, the overall speed of OSDM is more efficient than most existing algorithms, and the benefit is strengthened with more and more arriving instances.

Conclusion
In this paper, we propose a new online semanticenhanced dirichlet model for short text stream clustering. In contrast to existing approaches, OSDM does not require to specify the batch size and the dynamic number evolving clusters. It dynamically assigns each arriving document into an existing cluster or generating a new cluster based on the poly urn scheme. More importantly, OSDM tried to incorporate semantic information in the proposed graphical representation model to remove the term ambiguity problem in short-text clustering. Building upon the semantic embedding and online learning, our method allows finding high-quality evolving clusters. Extensive results further demonstrate that OSDM has better performance compared to many state-of-the-art algorithms.