Siamese Network-Based Supervised Topic Modeling

Label-specific topics can be widely used for supporting personality psychology, aspect-level sentiment analysis, and cross-domain sentiment classification. To generate label-specific topics, several supervised topic models which adopt likelihood-driven objective functions have been proposed. However, it is hard for them to get a precise estimation on both topic discovery and supervised learning. In this study, we propose a supervised topic model based on the Siamese network, which can trade off label-specific word distributions with document-specific label distributions in a uniform framework. Experiments on real-world datasets validate that our model performs competitive in topic discovery quantitatively and qualitatively. Furthermore, the proposed model can effectively predict categorical or real-valued labels for new documents by generating word embeddings from a label-specific topical space.


Introduction
As one of the most widely used text mining techniques, topic modeling can extract meaningful descriptions (i.e., topics) from a corpus (Blei, 2012). Most previous topic models, such as probabilistic Latent Semantic Analysis (pLSA) (Hofmann, 1999) and Latent Dirichlet Allocation (LDA) (Blei et al., 2001) are unsupervised. In unsupervised topic models, each document is defined as a mixture distribution over topics and each topic is represented as a mixture distribution over words. Unsupervised topic models only exploit words in documents and do not incorporate the guidance of labels into learning processes. Therefore, these models fail to discover label-specific topics, which are important to support personality psychology (Weiner and Graham, 1990), aspectlevel sentiment analysis (Liu, 2012), and crossdomain sentiment classification (He et al., 2011). For example, label-specific topics generated from sentimental texts can help to find attributions and causes for different sentiments by associating sentiments with real-world topics/events.
In light of this consideration, several supervised topic models are proposed to generate labelspecific topics. One of the most representative models is the supervised Latent Dirichlet Allocation (sLDA) (Blei and McAuliffe, 2007), which restricts a document being associated with one realvalued response variable. To deal with categorical labels, multi-class sLDA (sLDAc) (Wang et al., 2009) and Labeled Latent Dirichlet Allocation (L-LDA) (Ramage et al., 2009) are proposed, but they are only applicable to classification. Recently, a supervised Neural Topic Model (sNTM) (Cao et al., 2015) is developed to tackle supervised tasks of both classification and regression. As a hybrid method, sNTM is in essence a neural network by following the document-topic distribution in topic models. Unfortunately, the label information has a little effect on topic generation since sNTM models documents and labels separately.
The above limitation motivates us to develop a supervised topic model which can jointly model documents and labels. Particularly, we propose a Siamese Labeled Topic Model (SLTM) to exploit the information of documents and labels based on the Siamese network (Bromley et al., 1993;Hu et al., 2014;Wang and Zhang, 2017), where weight matrices in SLTM represent conditional distributions. Therefore, by constraining weight matrices during the learning procedure, SLTM can follow probabilistic characteristics of topic models strictly. Compared to previous supervised topic models, the main advantages of our SLTM are summarized as follows. First, SLTM can gener-ate more coherent label-specific topics than others. This is because the supervision of labels is incorporated into topic modeling for SLTM. On the other hand, the mapping of topics to labels is unconstrained for most existing supervised topic models, which renders many coherent topics being generated outside labels. Second, strengths of neural networks are incorporated into SLTM to bootstrap its inference power on label prediction. Third, each word can be mapped to a topical embedding space and represented by a word embedding after generating label-specific topics.
To validate the effectiveness of the proposed model, we evaluate it on two real-world datasets in text mining. Experimental results indicate that our method is able to discover more coherent and label-specific topics than baseline models. Moreover, word embeddings learned by the proposed model can be used to predict labels for new documents effectively.
The remainder of this paper is organized as follows. We summarize related studies on supervised topic modeling in Section 2. For convenience of describing our model, we present the neural network view of topic models in Section 3. Then, we detail the proposed SLTM in Section 4. Experimental design and analysis of results are shown in Section 5. Finally, we present conclusions and future work in Section 6.

Related Work
Topic models, which focus on discovering unobserved class variables named "topics" statistically, have been widely used in text mining. One of the early topic models is pLSA (Hofmann, 1999). In pLSA, a document's word vector was decomposed into a mixture of topics, and a topic was represented as a probability distribution over words. LDA (Blei et al., 2001) extended pLSA by adding Dirichlet priors for a document's multinomial distribution over topics and a topic's multinomial distribution over words, which makes it suitable to generate topics for unseen documents.
The aforementioned models are unsupervised, which may be computationally costly to do some task-specific transformation when there is extra labeling information (Cao et al., 2015). To address this issue, several supervised topic models have been proposed to introduce the label guidance in learning processes. One of the most widely used supervised topic models is sLDA (Blei and McAuliffe, 2007). In sLDA, each document was paired with a response variable which obeys the Gaussian distribution. By extending the sLDA, BP-sLDA (Chen et al., 2015) applied back propagation over a deep architecture in conjunction with stochastic gradient/mirror descent for model parameter estimation, leading to scalable and end-toend discriminative learning characteristics. Based on sLDA, multi-class sLDA (sLDAc) (Wang et al., 2009) was proposed to model documents with categorical labels by adding a softmax classifier rather than a linear regression in sLDA to a standard LDA. Another method of tackling corpora with discrete labels is L-LDA (Ramage et al., 2009), which associated each label with only one topic. To improve the performance of L-LDA in the classification task, Dependency-LDA (Dep-LDA) (Rubin et al., 2012) incorporated an extra topic model to capture the dependencies between labels and took the label dependencies into consideration when estimating topic distributions. Recently, a nonparametric supervised topic model (Li et al., 2018) was proposed to predict the response of interest (e.g., product ratings and sales). The limitation of above models is that they are only applicable to either discrete or continuous data.
In this paper, we propose a Siamese networkbased supervised topic model named SLTM. The most relevant work to SLTM is the supervised Neural Topic Model (sNTM) for both classification and regression tasks (Cao et al., 2015), which constructed two hidden layers to generate the ngram topic and document-topic representations. However, different from our SLTM using bagof-words methods, sNTM adopted fixed embeddings trained on external resources (Mikolov et al., 2013). Thus, sNTM can not learn data-specific topics. Furthermore, sNTM is hard to follow probabilistic characteristics of the topic-word distribution in topic models, because a topic generated by sNTM is composed of an infinite number of ngrams. Finally, sNTM modeled documents and labels separately rather than uniformly in our SLTM.

Preliminaries
For convenience of describing the proposed model, we use hollow uppercase letters (e.g., D) to represent collections, bold uppercase letters (e.g., W 1 ) to represent matrices, bold lowercase letters (e.g., y i ) to represent vectors, regular uppercase letters (e.g., M ) to represent scalar constants, The probability of v j given d i and regular lowercase letters (e.g., v j ) to represent scalar variables. Based on the above convention, frequently used notations are shown in Table 1. Given a document d i with labels y i , our goal is to discover topics with a neural network framework. Therefore, we first describe the neural network view of topic models briefly.
Topic modeling is a popular latent variable inference method for co-occurrence data which associates unobserved classes with observations d i and v j , where v j is a word in d i . The conditional probability p(v j |d i ) is defined as: (1) can be represented as the following vector form: ( We represent horizontal stack by commas and vertical stack by semicolons, thus Then, the vector form in Equation 2 can be ex- tended to: With Equation 3, topic models can be viewed as neural networks, where D and V are input sets, p(V|D) is the output set, and W 1 and W 2 are parameter matrices of the neural network.

Siamese Labeled Topic Model
Similar to generative models such as pLSA, we propose a Siamese Labeled Topic Model (SLTM) based on the aforementioned neural network perspective of topic models. Figure 1 illustrates the framework of generating each word in SLTM, and the process is as follows. For a document d i in D, the topic distribution p(Z|d i ) is estimated by: where d i is the indicator vector (Yang et al., 2013) of d i , which means that the i-th entry of d i is 1 and other entries are 0. Labels of d i are y i , which are generated from the topic distribution of d i as: where z k is the indicator vector of z k . Therefore, words can be generated from d i as: The architecture of SLTM from the perspective of neural networks is shown in Figure 2.
With respect to the model optimization, we adopt the contrastive objective function used in previous works (Socher et al., 2014;Cui et al., 2014;Cao et al., 2015;He et al., 2017). For document d i and every word v j in d i , we randomly sample a document from the document set D which does not contain v j , as a negative sample document. The negative sample document is represented as d and has labels y (v j −) i . As shown in Figure 1, the lower sub-network, which takes d as input, has the same architecture as the the upper sub-network, which takes d i as input. Because the document-topic distribution and the topic-word distribution of a corpus are fixed, W 1 , W 2 and W 3 are shared among two sub-networks of our model. These two sub-networks are twin networks and thus the proposed model is essentially the Siamese network. Our objective is to make word v j be learned by topics in document d i , while not be learned by topics in the negative sampled document d . Therefore, we only take word v j in V into consideration during the learning procedure, which can be implemented as dot-multiplying p(V|d i ) with the indicator vector of v j (i.e., v j ) as:p(v j |d i ) = p(V|d i ) · v j . Particularly, the objective is to make the predicted conditional probabilityp(v j |d i ) approach the observed conditional probability p(v j |d i ) (i.e., term frequency of word v j in document d i ), while make the conditional probabilityp(v j |d ) approach zero. Thus, the loss function of predicted conditional probabilities and the observed conditional We use another loss function loss(y i , y ). In the above, equations of loss(y i ) and loss(y ) depend on the property of labels. For categorical and real-valued labels, the cross-entropy (Tang et al., 2014) and the mean absolute error (Willmott and Matsuura, 2005) are adopted, respectively.
The maximization of the weighted sum of conditional likelihoods is equivalent to minimize the losses of the weighted sum of loss functions, and these two loss functions are weighted by a hyperparameter α as in (Tang et al., 2014). Thus, the loss function of SLTM is: The effect of α on predicting labels and discovering topics will be investigated in Section 5.6. Based on loss(SLT M ), three kinds of weights, i.e., W 1 , W 2 , and W 3 can be updated together by a vanilla back propagation (BP) algorithm with the early stopping criteria (Bengio, 2012). The training algorithm is shown in Algorithm 1.
After training, we obtain both document-topic and topic-word distributions. Then, each word can be mapped to a topic-level embedding space and represented as a word embedding. For instance, the word embedding of v j is generated from the topic-word distribution W 2 as: The generated word embeddings can be used for specific applications, such as label prediction. Particularly, we firstly represent a new document d n by its document embeddings e(d n ), where e(d n ) is the sum of word embeddings of all words in d n . Then, the predicted labelsŷ n of document d n can be estimated by: where W 4 denotes weights of each topic contributing to labels, and f (.) is the activation function which depends on the type of labels. For categorical and normalized real-valued labels, we respectively adopt softmax and sigmoid as activation functions. Note that we do not predict labels for new documents based on W 3 directly, because topic distributions of these documents can only be learned without the supervision of labels, i.e., new documents' topic distributions may be inconsistent to W 3 . Finally, we update W 4 and word embeddings by RMSprop (Tieleman and Hinton, 2012) for label prediction.

Experiments
In this section, we firstly describe datasets and the setting of experiments. Secondly, we investigate the quality of generated topics by the topic coherence score and qualitative analysis. Thirdly, the quality of generated word embeddings is evaluated by label prediction and word similarity. Finally, the effect of the hyper-parameter α is evaluated on coherence of topics and label prediction.

Datasets and Setting
To evaluate the effectiveness of our method comprehensively, we conduct experiments on two realworld datasets with categorical and real-valued labels, respectively. The first corpus named ISEAR contains a collection of 7,666 sentences and each item is manually tagged with a categorical label over 7 emotions (Scherer and Wallbott, 1994). The second dataset YouTube 1 is often used for sentiment strength detection, which contains 3,407 comments on videos and each item is labeled 1 http://sentistrength.wlv.ac.uk/ with a real value between 0.1 (i.e., very negative sentiment) and 0.9 (i.e., very positive sentiment). These two datasets are selected for their similar word numbers in average. After removing stop words, the mean numbers of words in each document are 8.53 and 8.56 for ISEAR and YouTube. Besides, it is appropriate to evaluate the model performance on predicting emotions and sentiment strengths, because topics play an important role in understanding sentences or user comments (Liu, 2012). Since the proposed SLTM is suitable to both topic discovery and classification/regression tasks, we employ five kinds of baselines for comparison.
The first kind are the support vector machine (SVM), an efficient deep learning model for classification (i.e., fastText) , and the following supervised topic models which are confined to categorical labels: • sLDAc (Wang et al., 2009): it models documents with categorical labels by adding a softmax classifier to a standard LDA.
• L-LDA (Ramage et al., 2009): it is a supervised model which associates labels with topics by one-to-one correspondence. Accordingly, the number of topics in L-LDA must equal the size of the label set.
• Dep-LDA (Rubin et al., 2012): it extends L-LDA by introducing a multinomial distribution over labels and capturing the dependencies between labels. Then, the label dependencies are used to sample topic distributions in supervised learning.
The second kind are the support vector regression (SVR), a state-of-the-art deep learning model for sentiment strength detection (i.e., HCNN) (Chen et al., 2017), and the following supervised topic models which are developed for predicting real-valued labels only: • sLDA (Blei and McAuliffe, 2007): it is a classical supervised topic model, in which, each document is paired with a response variable, and the variable is defined as a Gaussian distribution with a mean value that is computed by a linear regression of topics.
• BP-sLDA (Chen et al., 2015): it applies back propagation over a deep architecture together with stochastic gradient/mirror descent for   The third kind is a supervised n-gram model named sNTM, which is applicable to predict both categorical and real-valued labels for new documents (Cao et al., 2015). In sNTM, each n-gram is represented by a 300-dimensional embedding vector using the available tool word2vec 2 . By following (Cao et al., 2015), a large-scale Google News dataset with around 100 billion words is adopted for training. For topic discovery, two unsupervised topic models, pLSA (Hofmann, 1999) and LDA (Blei et al., 2001), are used as the fourth kind of baselines. Finally, we adopt two hybrid methods by combining LDA and supervised learning algorithms as baselines. In particular, a softmax classifier and a liner regression (LR) are used to predict categorical and real-valued labels for documents, respectively. Unless otherwise specified, we set α to 0.5 and adopt the stochastic gradient descent with batch size of 100 for training SLTM.

Coherence Score of Topics
To investigate the quality of topics discovered by SLTM quantitatively, we use the topic coherence score based on the normalised pointwise mutual information (Lau et al., 2014) as the evaluation metric. Intuitively, a topic coherence score that is larger indicates that the quality of topics is better. All unsupervised topic models (i.e., pLSA and LDA) and supervised methods which associate one label with multiple topics (i.e., sLDAc, sLDA, BP-sLDA, and sNTM) are adopted for comparison. Although L-LDA and Dep-LDA can identify label-specific topics on ISEAR, these models' one-to-one mapping of labels and topics makes them unsuitable in this evaluation. Particularly, L-LDA and Dep-LDA constraint each topic to words in certain documents with the same label, which renders their coherence scores being estimated by a subset of the corpus only. On the other hand, the quality of topics is evaluated on the whole corpus for SLTM and other baseline models. The average coherence scores of topics generated by different models on ISEAR and YouTube are respectively shown in Table 2 and Table 3, where the number of topics is 20, the number of top words T is set to 5, 10, and 15, and the best scores are highlighted in boldface. The results indicate that SLTM can discover more coherent topics than both unsupervised topic models and supervised methods, except for T = 5 on YouTube. It is also interesting to observe that supervised baseline models (i.e., sLDAc, sLDA, BP-sLDA, and sNTM) perform worse than pLSA and LDA for most cases, which validates that it is challenging to trade off label-specific word distributions with document-specific label distributions (Ramage et al., 2009).

Qualitative Analysis on Topics
In this part, we conduct qualitative inspection of 20 topics generated by SLTM. The ISEAR dataset which contains multiple labels is used for illustration, since it is inappropriate to present the results on YouTube with a single real-valued label. For each model that is applicable to ISEAR, we show top 5 words of the generated label-specific topics in Table 4. It is worth to note that L-LDA and Dep-LDA achieve the same top words, since their difference only exists in the process of label prediction. The results indicate that although all models can learn meaningful topics, SLTM performs better than baseline models in label-specific topic discovery. For example, two words "happy" and "joy" which are strongly related to the label of "joy" are identified by SLTM with large probabilities. Similar results can be observed in other labels, thus topics discovered by our model are more convenient to be understood than others. Such a kind of performance enhancement is valuable to many real-world applications, e.g., personality education and psychotherapy, by producing human interpretable topics/events that evoke users' particular emotions.
For completeness, we also examine all topics generated by the baseline of sNTM. As mentioned earlier, sNTM is based on n-grams, instead of single words for SLTM and other baseline models. In the practical implementation, only unigrams and bigrams are considered since the embedding representation becomes less precise as n increases (Cao et al., 2015). The results indicate that sNTM can generate some topic bigrams such as "smelled disgusting" and "graduation exams", which are more appropriate to expressing a topic. However, only three topics are manually examined to be correlated with the seven emotions. This validates that sNTM is hard to introduce the guidance of labels in topic generation, because it models documents and labels separately.
To further evaluate the interpretability of topics extracted from SLTM, we firstly get topic embeddings by: emb(z k ) = W 1 [k, :]. Then, we map emb(z k ) to a two-dimensional space via Principal Component Analysis (PCA). Figure 3 presents distributions of topics generated by SLTM over the ISEAR dataset. The scatter plot indicates that topics corresponding to the same label are closer than those of different labels. Furthermore, the distance between topics on correlated labels such as "fear" and "anger", is closer than that of topics on "joy" and other labels.

Evaluation on Label Prediction
We here evaluate the quality of word embeddings generated by SLTM on predicting categorical and real-valued labels based on ISEAR and YouTube, respectively. Since there are varied parameters for different models, we randomly select 60% of instances as the training set, 20% as the validation set, and the remaining 20% as the testing set. The values of parameters (e.g., the number of topics) for each model are all determined by the validation set. In label prediction, the main difference between SLTM and other supervised topic models is as follows. On one hand, a label-specific word embedding is introduced for predicting labels in SLTM according to Equation 10. On the other hand, other supervised topic models for both categorical and real-valued label prediction tasks infer labels for unlabeled documents by topic distributions directly, in which, topic distributions of unlabeled documents are learned without the supervision of labels.
For the task of categorical label prediction, the accuracy and the Cohen's kappa score (Artstein and Poesio, 2008) are used as the evaluation metrics. Table 5 shows the classification performance of different models on ISEAR, where the best results are highlighted in boldface. For the prediction of real-valued labels on YouTube, we compare different models' regression performance by the mean absolute error (MAE) and the predictive R 2 (pR 2 ) (Blei and McAuliffe, 2007), as shown in Table 6. From the above results we can observe that SLTM achieves substantial performance improvement over baselines in predicting both categorical and real-valued labels, which indicates that word embeddings generated from labeled documents are more suitable for label prediction tasks than topic distributions generated from unlabeled documents without the guidance of labels.

Similarity of Word Embeddings
Word embeddings can reflect relations between words, and most methods of generating word embeddings are based on the local context information. This is because words with similar contexts may have similar semantics. However, a largescale corpus is required to learn high quality word embeddings from the local context. Different from the previous word embedding generation methods, SLTM generates word embeddings based on the global label-specific topic information (i.e., the topical embedding space). Therefore, we further compare the quality of word embeddings learned by SLTM and three widely used methods: Word2Vec (W2V) (Mikolov et al., 2013), subword information Word2Vec (siW2V) , and SSPMI (Levy and Goldberg, 2014). Among these baseline word embedding models, W2V and siW2V use the neural network framework, and SSPMI implicitly factorizes the pointwise mutual information (PMI) matrix of the local word co-occurrence patterns.  As our evaluation metric, the word similarity is estimated as follows. Firstly, we calculate cosine similarity scores for word pairs which occur in both the training set and the testing set. Secondly, word pairs are ranked according to their cosine similarities in the embedding space and human-assigned similarity scores, respectively. Finally, rankings of word similarity scores are evaluated by measuring the Spearman's rank correlation with rankings of human-assigned similarity scores. A higher correlation value indicates that it is more consistent to human judgements in word similarity. The following standard corpora which contain word pairs associated with human-assigned similarity scores are used for this evaluation: MEN (Bruni et al., 2014), SimLex-999 (SimLex) (Hill et al., 2015), and Rare (Luong et al., 2013).
We train W2V, siW2V, and SSPMI over each corpus by setting the number of context window size to 5. Furthermore, the dimension of word embeddings generated from all models is set to 50 according to (Lai et al., 2016). The values of word similarity on ISEAR and YouTube are respectively shown in Table 7 and Table 8, where the best results are highlighted in boldface. We can observe that SLTM outperforms baselines for all cases. The results indicate that word embeddings learned from the global label-specific topic information are better than those from the local context information without any external corpora.

Effect of the Hyper-parameter
After validating the effectiveness of SLTM on discovering topics and learning word embeddings, we now investigate the effect of the hyperparameter in SLTM on these two aspects. According to Equation 8, the hyper-parameter α is used to weight two kinds of loss functions. Since W 2 can be updated subject to α > 0, we evaluate the performance of SLTM by varying α from 0.1 to 1 over the ISEAR dataset, as follows. First, we evaluate the influence of hyperparameter α on topic discovery by the coherence score of topics. To clearly illustrate the performance trend with different values of α, we set the number of top words T to 5, 10, and 15, and present topic coherence scores in Figure 4. The results indicate that SLTM performs stably under these α values on topic discovery, except for α = 1 which ignores the label information totally. This validates the importance of label information in generating coherent topics.
Second, we use the learned word embeddings to predict document labels under different values of α. As shown in Figure 5, we can observe that when α = 0.5, i.e., loss(d i , d ) are weighted equally, SLTM achieves the best performance in label prediction. The results indicate that the co-occurrence of documents and words as well as the label information are both important to generate good word embeddings. Furthermore, the label prediction performance of SLTM using any of these α values is better than that of most baselines (ref. Table 5). This validates the robustness of SLTM with different hyper-parameter values in supervised learning. We also conduct experiments on YouTube using varied α values, which indicates that the hyperparameter has a similar effect on both datasets.

Conclusion
In this paper, we proposed a supervised topic model named SLTM to discover label-specific topics by jointly modeling documents and labels. For the SLTM, weight matrices which represent document-topic and topic-word distributions can strictly follow probabilistic characteristics of topic models. Experiments were conducted on datasets with both categorical and real-valued labels, which validated that SLTM can not only discover more coherent topics, but also boost the performance of supervised learning tasks by learning high quality word embeddings. For future work, we plan to speed-up the training process of SLTM by GPUs and distributed algorithms. With the development of deep learning techniques, we also plan to de-emphasize irrelevant words with an attention mechanism.