Short Text Understanding by Leveraging Knowledge into Topic Model

In this paper, we investigate the challenging task of understanding short text ( STU task) by jointly considering topic modeling and knowledge incorporation. Knowledge incorporation can solve the content sparsity problem effectively for topic modeling. Speciﬁcally, the phrase topic model is proposed to leverage the auto-mined knowledge, i


Introduction
The explosion of online text content, such as twitter messages, text advertisements, QA community messages and product reviews has given rise to the necessity of understanding these prevalent short texts.
Conventional topic modeling, like PLSA (Hofmann, 1999) and LDA (Blei et al., 2003) are widely used for uncovering the hidden topics from text corpus. However, the sparsity of content in short texts brings new challenges to topic modeling.
In fact, short texts usually do not contain sufficient statistical signals to support many state-ofthe-art approaches for text processing such as topic modeling (Hua et al., 2015). Knowledge is indispensable to STU task, where knowledge-based topic model (Andrzejewski et al., 2009;Hu et al., 2011;Jagarlamudi et al., 2012;Mukherjee and Liu, 2012;Chen et al., 2013;Yan et al., 2013) has attracted more attention recently. *

Corresponding author
We consider, in the STU task, the available knowledge can be divided into two classes: selfcontained knowledge and external knowledge. Self-contained knowledge, which is focused in this paper, is extracted from the short text itself, such as key-phrase. External knowledge is constructed without special purpose, such as WordNet (Miller, 1995), KnowItAll (Etzioni et al., 2005), Wikipedia (Gabrilovich and Markovitch, 2007), Yago (Suchanek et al., 2007), NELL (Carlson et al., 2010) and Probase (Wu et al., 2012).
PLSA and LDA are the typical unsupervised topic models, that is non-knowledgeable model. In contrast, Biterm topic model (BTM) (Yan et al., 2013) leverages self-contained knowledge into semantic analysis. BTM learns topics over short texts by modeling the generation of biterms in the whole corpus. A biterm is an unordered word-pair co-occurring in short contexts. BTM posits that the two words in a biterm share the same topic drawn from a mixture of topics over the whole corpus. The major advantage of BTM is that BTM explicitly model the word cooccurrences in the local context, which well captures the short-range dependencies between words.
External knowledge-based models incorporate expert domain knowledge to help guide the models. DF-LDA (Andrzejewski et al., 2009) Figure 1: The phrase topic model proposed in this paper.
vast amount of lexical knowledge about words and their relationships, denoted as LR-sets, available in online dictionaries or other resources can be exploited by this model to generate more coherent topics.
However, for external knowledge-based models, the incorporated knowledge is too general to be consistent with the short text in the semantic space. On the other hand, BTM, as a typical self-contained knowledge-based model, makes rough assumption on the generated biterms. The generated biterms are inundated with noise, for not any two terms in short text share same topic. Based on the above analysis, we first identify key-phrases from short text, which can be deemed as self-contained knowledge, then propose phrase topic model (PTM), which constrains same topic for terms in key-phrase and sample topics for non-phrase terms from mixture of keyphrase's topic.

Model
A phrase is defined as a consecutive sequence of terms, or unigrams. In this paper, we focus on self-contained knowledge in short text, i.e., the key-phrases. Key-phrase extraction is a fundamental component in our work. We use CRF++ 1 to identify key-phrases in a short text. The training data is built manually, and the features contain the word itself, the part of speech tagged by Stanford Log-linear Part-Of-Speech Tag-ger (Toutanova et al., 2003). Sample identified keyphrases are shown in Table 2.
In this paper, our phrase topic model is proposed based on three assumptions: • Key-phrases are the key points of interest in the short text, which should be the focus.
• Terms consisting of the same key-phrase will share common topic.
• Non-phrase term's topic assignment should depend on that of key-phrases in the same text.
Our assumptions is indeed similar to other models (Gruber et al., 2007), for example each sentence is assumed to be assigned to one topic, however this assumption is too general, in many cases, different words should be assigned different topics even in short text. Our model is more refined to distinguish key-phrase and non-phrase. In addition, if two or more key-phrases exist in the same short text, they are probably assigned different topics.
The graphical representation of PTM is illustrated in Figure 1. α and β are hyper-parameters, which are experienced tuned. φ is corpus-level parameter, while θ is document-level parameter. The hidden variables consist of z m,n and δ m,s . The generative process of phrase topic model is presented as follows.
From this process, we can see the generation of keyphrases and non-phrases are distinguished and nonphrase's generation is based on the topic assignment of key-phrases in the same document.

Inference By Gibbs Sampling
Similarly with LDA, collapsed Gibbs sampling (Griffiths and Steyvers, 2004) can be utilized to perform approximate inference. In our model, the hidden variables are key-phrase's topic assignment z and non-phrase word's topic assignment δ. To perform Gibbs sampling, we first randomly initialize the hidden variables. Then we sample the topic assignment based on the conditional distribution p(z m,n = k|z ¬(m,n) , w, o, δ) and p(δ m,s = k|z, w, o, δ ¬(m,s) ).
We can derive the conditional probability for z m,n following Equation 1, where n k m,¬(m,n) denotes the number of key-phrases whose topic assignment are k in document m without consideration of key-phrase {m, n}, which is similar to n k ′ m,¬(m,n) . n w m,n,l k,¬(m,n) denotes the number of times key-phrase term w m,n,l assigned to topic k without consideration of key-phrase {m, n}, which is similar to n w k,¬(m,n) . n om,s k,¬m denotes the number of times nonphrase term o m,s assigned to topic k without consideration of document m, which is similar to n w k,¬m . Similarly, we can derive the conditional probability for δ m,s following Equation 2, where n om,s k,¬(m,s) denotes the number of times non-phrase term o m,s assigned to topic k without consideration of nonphrase term {m, s}, which is similar to n w k,¬(m,s) . L m denotes the number of topics assigned to keyphrases in document m.
Finally, we can easily estimate the topic distribution θ m,k and topic-word distribution φ k,w following Equation 3 and 4.

Experiments and Results
Online reviews dataset (Chen et al., 2013), which consists of four domains, is utilized to evaluate our model, where each domain collection contains 500 reviews. Each review's average length is 20.42. The statistics of each domain are presented in Table 1. It's worth noting that the Phrase is auto-identified by the key-phrase extraction method. And the Word represents the whole distinct words for those identified key-phrases. In our paper, we assumed each domain has a single topic model. For different domain, we think the semantic space is quite different. So we performed the proposed topic model with respect to different domain. The number of topics is usually determined by experience, in our experiment, each domain collection contains 500 reviews, we think the number of topics ranging from 2 to 20 is appropriate, and these reviews are sufficient to train a topic model. We compare our model with four baseline models: non-knowledgeable model LDA, self-contained knowledgeable model BTM, external knowledgebased model GK-LDA (Chen et al., 2013) and DF-LDA (Andrzejewski et al., 2009). Those identified key-phrases are used as must-links in DF-LDA and LR-sets in GK-LDA. This can ensure the incorporated knowledge upon different models are equal. Table 2 illustrates the auto-identified phrases from cellphone dataset. From this result, we can see key-phrase extraction method can efficiently identify mostly phrases. More than one phrase, for example warranty service and android phone, may appear in a single sentence, and their topic assignments are probably different. Our proposed phrase topic model(PTM) can well handle this case, which is more well-defined than the assumption of all words within a sentence share one topic. Our phrase topic model assumes non-phrase term's topic assignment should depend on that of key-phrases in the same text. This assumption can be clearly confirmed by Table 2, for example, Nokia N97 mini is semantic dependent US- B charge cable, the same as company and warranty service.
For all models, posterior inference was drawn after 1000 Gibbs iterations with an initial burn-in of 800 iterations. For all models, we set the hyperparameters α = 2 and β = 0.5.
The evaluation results over Topic Coherence Metric are presented in Figure 2 and Figure 3. This figure indicates our model and BTM can get higher topic coherence score than GK-LDA and DF-LDA, which means the self-defined knowledge and the mechanism of knowledge incorporation are effective to topic model. LDA's performance is acceptable but not stable. Our model performs better than BTM, which is probably because the rough assumption of BTM on generated biterms. From the above analysis, we can see our proposed model can get the best performance.
T-test results show that the performance improvement of our model over baselines is statistically significant on Topic Coherence Metric. All p-values for t-test are less than 0.00001. Figure 4 presents the fluctuation of topic coherence when tuning the hyper-parameter α and β. We can see that the performance fluctuates within a limited range as we vary α and β. The topic coherence fluctuates between −550 and −950 other than food dataset, which gets less fluctuation range. Table 3 shows example topics for each domain, where inconsistent words are highlighted in red. From this results, we can see the number of errors in phrase topic model(PTM) is significantly less than LDA, which indicates our proposed topic model is more suitable than LDA for short text.

Conclusions and Future Work
In this paper, we present a topic model to achieve STU task starting from key-phrases. The terms in key-phrases identified from the short texts are supposed to share a common topic respectively. And those key-phrases are assumed to be the central focus in the generative process of documents. In the future work, the self-contained knowledge, such as those identified key-phrases, and the external knowledge-base should be integrated to guide topic modeling.