PhraseCTM: Correlated Topic Modeling on Phrases within Markov Random Fields

Recent emerged phrase-level topic models are able to provide topics of phrases, which are easy to read for humans. But these models are lack of the ability to capture the correlation structure among the discovered numerous topics. We propose a novel topic model PhraseCTM and a two-stage method to find out the correlated topics at phrase level. In the first stage, we train PhraseCTM, which models the generation of words and phrases simultaneously by linking the phrases and component words within Markov Random Fields when they are semantically coherent. In the second stage, we generate the correlation of topics from PhraseCTM. We evaluate our method by a quantitative experiment and a human study, showing the correlated topic modeling on phrases is a good and practical way to interpret the underlying themes of a corpus.


Introduction
In recent years, topic modeling on phrases has been developed for providing more interpretable topics (El-Kishky et al., 2014;Kawamae, 2014;He, 2016). They represent each topic as a list of phrases, which are easy to read for humans. For example, the topic represented in "grounding conductor, grounding wire, aluminum wiring, neutral ground, ..." is easier to read than the topic with words "ground, wire, use, power, cable, wires, ...", although they are both about the topic of household electricity.
But when the number of topics grows, it's hard to review all the topics, even they are represented in phrases. The correlation structure is introduced by CTM ( Blei and Lafferty, 2005) to figure out the correlated relationship between topics and group the similar topics together. And the correlated topics mined from the scientific papers (Blei and Lafferty, 2007), news corpus (He et al., 2017), and social science data (Roberts et al., 2016), showed their practical utility on grasping the semantic meaning of text documents.
However, it's nontrivial to apply CTM directly on phrases. The reasons are mainly due to two facts: (1) phrases are much less than words in each document; (2) similar to LDA (Tang et al., 2014), CTM doesn't perform well on short documents. Therefore, CTM needs more contextual information to build a good enough model, rather than only using the extracted phrases.
To find out the correlated topics at phrase level, we take full advantage of contextual information about the phrases. Firstly, the topic of a phrase in a document is highly related to the topics of other words and phrases in the same document. Secondly, some phrases' meaning can be implied from their component words. Taking a document in Figure 1 as an example, the phrase "orbital vehicle" shares the same topic as the word "DC-X" (a reusable spaceship), as well as its component words "orbital", and "vehicle", which are all about the topic of space exploration. The assumption that the words within the same phrase tend to have the same latent topic is directly used by PhraseLDA (El-Kishky et al., 2014). Note that not all the phrases always have the same topic as their component words (e.g., the newspaper Boston Globe) (Mikolov et al., 2013). It's difficult to distinguish the "orbital vehicle" type phrases from "Boston Globe" type phrases, but we can use the data-driven method to find out the semantically coherent ones by the NPMI metric (Bouma, 2009), andput them in Markov Random Fields (Kindermann andSnell, 1980) to align the topics of phrases and their component words.
It will be tough to make DC-X succeed, and to turn it into an operational orbital vehicle. Doubtless it will fail to meet some of the promised goals. The reason people are so fond of it is that it's the only chance we have now, or will have for a ong time to come, to develop a launch vehicle with radically lower costs. … "orbital vehicle" "launch vehicle" Figure 1: An example of phrases' contextual information. The phrases and words marked gray are about the same topic. The arrows show the topics of the phrases and their component words tend to be same, which tendency are modeled within Markov Random Field.
Based on these two kinds of contextual information, we propose a novel topic model PhraseCTM and a two-stage method. In the first stage, we train PhraseCTM, which (1) double counts the phrases as two parts, one as the phrase itself, the other as the component words; (2) models the generation of words and phrases simultaneously by linking the phrases and component words within Markov Random Fields when they are semantically coherent; (3) uses the logistic normal distribution to represent the correlation among the topics, like a previous method CTM. In the second stage, we generate the correlation of topics from PhraseCTM.
We evaluate our method on five datasets by a quantitative experiment and a human study, showing that the correlated topic modeling on phrases is a good way to interpret the underlying themes of a corpus.

Related Works
There are two orthogonal lines of research studies related to our work. (1) With the development of phrase extraction techniques (El-Kishky et al., 2014;Liu et al., 2015;Shang et al., 2018), several topic models based on extracted phrases are proposed to provide high-quality phrase-level topics, such as PhraseLDA (El-Kishky et al., 2014), and TPM (He, 2016). Because of the quality of extracted phrases, PhraseLDA performs better than previous n-gram method TNG (Wang et al., 2007), which combines phrase extraction and topic modeling together. (2) CTM (Blei and Lafferty, 2005) uses the logistic normal distribution (Aitchison, 1982) to replace the Dirichlet prior, so it can capture the correlated structure of topics. And the experiments in the further works Roberts et al., 2016;He et al., 2017) showed its usefulness in exploring the text corpus by using the correlated word-level topics. Note that our work is not a simple combination of these two methods, because the existing topic models on phrases lack the ability to capture the correlation structure while CTM cannot be directly applied on phrases due to the sparseness of phrases in each document. And as we used Markov Random Fields (Kindermann and Snell, 1980), our work is different from previous ones (Daume III, 2009;Sun et al., 2009;Xie et al., 2015) because we don't put all links into Markov Random Fields but only choose the semantic coherent links.

The proposed method
By preparing data as described in the subsection 3.1, our method is carried out in two stages, shown in the subsection 3.2, and 3.3 respectively.

Semantically Coherent Links for MRF
Given the input data in the form of raw text documents, we transform each document into the format as "words, phrases, semantically coherent links between phrases and component words". Extracting words is trivial, and extracting phrases can be conducted by using an existing tool, e.g., AutoPhrase (Shang et al., 2018). In this process, each extracted phrase is counted twice, one as the phrase itself (represented in the phrase vocabulary), the other is divided into component words (represented in the word vocabulary).
In a given document, we denote the i-th phrase as w (P) i , and its component words as w l(i) . We use the Equation (1) to determine the semantic coherent score between w (P) i and w l(i) by utilizing NPMI (Bouma, 2009) . The NPMI metric is defined upon two word types as NPMI( Bouma (2009) pointed out that NPMI has the advantage that it ranges within the fixed interval. Inherited from NPMI, the semantic coherent score also ranges from -1 to 1. A negative semantic coherent score means the phrase does not share the same topic with its component words in the corpus level (e.g., long run, the newspaper Boston Globe). A positive score means the opposite, and the score 1 suggests that the phrase and its component words should be aligned to the same topic in whole corpus. By a reasonable threshold τ , we can add the semantically coherent link for w In practice, we set τ to 0.4. Assuming the topic of the phrase w (P) i is z (P) i and the topics of words w l(i) are z l(i) , when they have the above mentioned semantically coherent link, we put z (P) i and z l(i) in a Markov Random Field. More specifically, for z (P) i and z j , j ∈ l(i), there's the edge potential function exp{κ·1(z where κ is the weight to adjust how much the link is introduced to constraint the topics to be same. In the following experiment, κ is set to be 10 −3 .

PhraseCTM
In the first stage, when given the prepared data as described in subsection 3.1, we are going to train better correlated phrase-level topics β (P) . The contextual information of phrases include (i) words in the same document, and (ii) their component words within semantically coherent links. Part (ii) has been modeled in the previous subsection. For part (i), we let the phrases and words in a same document d share the topic parameter η d , which is a K-dimension vector sampled from a Gaussian distribution N (µ, Σ). Like CTM, Σ is the covariance matrix, modeling the correlation between topics.
As a part of MRF, the unary potential on the topic node z (P) d,i or z d,j is defined by a logisticnormal distribution like CTM p(z d,j = k|η d ) = exp η d,k / ∑ k exp η d,k . Therefore, the joint distribution of topics over the phrases and the words in document d are defined as the following equation, where A d (η d ) is used for normalization, and N L d is the number of semantically coherent links in document d.
The whole generation process is illustrated in Figure 2(a). We train PhraseCTM by variational inference like CTM. The different part lies in the phrases and component words which are in semantically coherent links. For these phrases, we use Eq.
(2) to update the variational parameters φ (P) d,i for the latent topic z (P) d,i of the phrase w (P) d,i . For the phrases that are not in semantically coherent links, we use the Eq. (3), which is same as the original CTM's variational inference. In Eqs. (2) and (3), λ d is the variational parameter for η d such that Similarly, the variational parameters for the component words in semantically coherent links are updated by Eq. (4), while other words are updated by Eq. (5). In this way, PhraseCTM introduces the impact from words and phrases on each other by Markov Random Fields.
The first stage: training on our proposed model Phra-seCTM. When observed words W and phrases W (P) , we learn word topics β, and phrase topics β (P) .
The second stage: inferring the phrase topics' correlation. When given the phrases W (P) , and the phrase topics β (P) learned from the first stage, we infer the phrase-level topics' covariance Σ (P) as the correlation result.

Generation of Phrase Topics' Correlation
In the second stage, we aim to get the correlation for β (P) . It cannot be directly derived from Σ, which has also been learned in the first stage, because Σ consists the impact from word topics. Thus, given W (P) and β (P) , we use the variational inference again to learn Σ (P) as the illustration of Figure 2(b). Finally, the correlation matrix can be computed by corr (P) Table 1: The statistics of the datasets. In average, phrases appear more sparse than words. Phrases are extracted by AutoPhrase (Shang et al., 2018).

Experiments
PhraseCTM is supposed to get benefits from two aspects: (1) generating high-quality phrase-level topics; (2) providing the correlation among phrase topics to help users to understand the underlying themes of a corpus. To check the first claim, we compare PhraseCTM with existing topic models on phrases. To evaluate the second claim, we design a user study to compare PhraseCTM with standard CTM that runs only on words.
Datasets. We choose several public text corpora, including 20Newsgroups (Lang, 1995), subsets of English Wikipedia, a subset of PubMed Abstracts (Varmus et al., 1999). Due to efficiency problem, we do not test on the whole Wikipedia corpus. We construct the Mathematics, Chemistry, and Argentina subsets of English Wikipedia as . For each corpus, we extract the phrases by the implementation 1 of Au-toPhrase (Shang et al., 2018), and build the semantically coherent links as subsection 3.1. More specifically, in phrase extraction process, we set the minimum support as 5, and leave other parameters in AutoPhrase as its suggestion. In average, each document of the resulted 20Newsgroups only contains 2.7 phrases while 72.3 words, showing that phrases are much less than words. More statistics about the datasets are shown in Table 1.

Quantitative Result
Baselines. We compare with PhraseLDA (El-Kishky et al., 2014), the state-of-the-art model on phrases. Besides that, we run plain LDA and plain CTM 2 directly on the extracted phrases (without considering the impact of words). To check the effectiveness of MRF, we run a variant version PhraseCTM(-) by removing all the semantically coherent links from PhraseCTM. We also run a ngram based topic model TNG (Wang et al., 2007), which has already been implemented in Mallet 3 . All the topic numbers are set to be 100. For plain LDA and PhraseLDA, we set β = 0.005. For plain CTM and TNG, we use the default settings in their existing implementations. Since TNG combines phrase extraction and topic modeling together, we run it on the raw datasets.
We use the NPMI metric (Bouma, 2009) to evaluate the semantic coherence of top-10 phrases in each topic (K=100), by taking the entire English Wikipedia as the reference corpus. Roder (2015) has shown that the NPMI metric is highly correlated to human topic coherence ratings, so it's natural to use it to show how PhraseCTM improves the quality of topics. Although we have already used NPMI in the semantic coherent score for finding semantically coherent links, it does not influence the rationality of the metric used for topic evaluation, because the semantic coherent score is defined upon two word types while the NPMI score on topics is defined upon two phrase types. The result is shown in Figure 3. Due to the computational costs, TNG cannot scale up to large datasets. Plain LDA and plain CTM performs not well on the datasets because of the sparsity of phrases in each document, while TNG performs better than them as it can utilize more words as its contextual information. PhraseCTM(-) is comparable to PhraseLDA in the experiment. PhraseLDA also utilizes all the contextual information with the assumption that the words in a phrase have the same topic. But this assumption  is too strong, which can be adjusted by our introduced semantically coherent links in MRF. This experiment demonstrates that our method has generated high-quality phrase-level topics.

Human Study
To compare the correlated topics at different level, we directly run CTM on words. Trained on Argentina@Wiki and Maths@Wiki by CTM and PhraseCTM respectively, we outputted the topics with top-10 words/phrases in each topic and the correlation of topics for 10 human annotators, and asked them to label the topics. The duration of consuming time for topic labeling is a quite useful metric to check whether the topics are easy to understand for human annotators. It's based on our following observation in the human study: the confused topic may consume more time to give it an appropriate label, while the good one is easy to understand for human and consumes less time.
There are 2 groups of annotators. The annotators in Group A got the CTM result on Maths@Wiki and PhraseCTM's on Ar-gentina@Wiki. The annotators in Group B got the results in the opposite setting. The labeling process was logged to calculate the accumulated time. The labeling time to reach 50 accurate topics' labels on PhraseCTM is much less than the labeling time on CTM. In average, the annotators spent 7.1 minutes on PhraseCTM while 13.2 minutes on the others, which is listed in Table 2. In Figure 3, it's easy to label the topics in top right corner as poli-  tics. And the edges in the figure illustrate the correlation between topics. As an example, the edge between the topic 71 and the topic 31 represents that the economics and the politics in Argentina is related, helping users to understand the corpus.

Conclusion
We provide a new topic model PhraseCTM to make the Correlated Topic Modeling available for phrase-level topics. PhraseCTM utilizes more contextual information of phrases, and put them within Markov Random Fields, so it can provide high-quality correlated topics at phrase level. The experiments show that the correlated topic modeling on phrases is a practical tool to interpret the underlying themes of a corpus. In future, we will optimize the efficiency of PhraseCTM to scale it up to large datasets.