Accounting ngrams and multi-word terms can improve topic models

The paper presents an empirical study of integrating ngrams and multi-word terms into topic models, while maintaining similarities between them and words based on their component structure. First, we adapt the PLSA-SIM algorithm to the more widespread LDA model and ngrams. Then we propose a novel algorithm LDA-ITER that allows the incorporation of the most suitable ngrams into topic models. The experiments of integrating ngrams and multi-word terms conducted on ﬁve text collections in different languages and domains demonstrate a signiﬁcant improvement in all the metrics under consideration.


Introduction
Topic models, such as PLSA (Hofmann, 1999) and LDA (Blei et al., 2003), have shown great success in discovering latent topics in text collections. They have considerable applications in the information retrieval, text clustering and categorization (Zhou et al., 2009), word sense disambiguation (Boyd-Graber et al., 2007), etc.
However, these unsupervised models may not produce topics that conform to the user's existing knowledge (Mimno et al., 2011). One key reason is that the objective functions of topic models do not correlate well with human judgements (Chang et al., 2009). Therefore, it is often necessary to incorporate semantic knowledge into topic models to improve the model's performance. Recent work has shown that interactive human feedback (Hu et al., 2011) and information about words (Boyd-Graber et al., 2007) can improve the inferred topic quality.
Another key limitation of the original algorithms is that they rely on a "bag-of-words" as-sumption, which means that words are assumed to be uncorrelated and generated independently. While this assumption facilitates computational efficiency, it loses the rich correlations between words. There are several studies, in which the integration of collocations, ngrams and multi-word terms is investigated. However, they are often limited to bigrams (Wallach, 2006;Griffiths et al., 2007) and often result in a worsening of the model quality due to increasing the size of a vocabulary or to a complication of the model, which requires time-intensive computation (Wang et al., 2007).
The paper presents two novel methods that take into account ngrams and maintain relationships between them and the words in topic models (e.g, weapon -nuclear weapon -weapon of mass destruction; discrimination -discrimination on basis of nationality -racial discrimination). The proposed algorithms do not rely on any additional resources, human help or topic-independent rules. Moreover, they lead to a huge improvement of the quality of topic models.
All experiments were carried out using the LDA algorithm and its modifications on five corpora in different domains and languages.

Related work
The idea of using ngrams in topic models is not a novel one. Two kinds of methods are proposed to deal with this problem: the creation of a unified topic model and preliminary extraction of collocations for further integration into topic models.
Most studies belong to the first kind of methods and are limited to bigrams: i.e, the Bigram Topic Model (Wallach, 2006) and LDA Collocation Model (Griffiths et al., 2007). Besides, Wang et al. (2007) proposed the Topical N-Gram Model that allows the generation of ngrams based on the context. However, all these models are mostly of theoretical interest since they are very complex and hard to compute on real datasets.
The second type of methods includes those proposed in (Lau et al., 2013;Nokel and Loukachevitch, 2015). These works are also limited to bigrams. Nokel and Loukachevitch (2015) extend the first work and propose the PLSA-SIM algorithm, which integrates top-ranked bigrams and maintains the relationships between bigrams sharing the same words. The authors achieve an improvement in topic model quality.
Our first method in the paper extends the PLSA-SIM algorithm (Nokel and Loukachevitch, 2015) by switching to ngrams and the more widespread LDA model. Also we propose a novel iterative LDA-ITER algorithm that allows the automatic choice of the most appropriate ngrams for further integration into topic models.
The idea of utilizing prior knowledge in topic models is not a novel one, but the current studies are limited to words. So, Andrzejewski et al. (2011) incorporated knowledge by Must-Link and Cannot-Link primitives represented by a Dirichlet Forest prior. These primitives were then used in (Petterson et al., 2010;Newman et al., 2011), where similar words are encouraged to have similar topic distributions. However, all such methods incorporate knowledge in a hard and topicindependent way, which is a simplification since two words that are similar in one topic are not necessarily of equal importance for another topic.
Also several works seek to utilize the domainindependent knowledge available in online dictionaries or thesauri (such as WordNet) (Xie et al., 2015). We argue that this knowledge may be insufficient in the particular text corpus.
Our current work proposes an approach to maintain the relationships between ngrams, sharing the same words. Our method does not require any complication of the original LDA model and just gives advice on whether ngrams and words can be in the same topics or not.

Proposed algorithms
First, we adapt the PLSA-SIM algorithm proposed in (Nokel and Loukachevitch, 2015). We argue that the more widespread model is LDA (Blei et al., 2003). So we transfer the idea of the PLSA-SIM algorithm to LDA and adapt it to multi-word expressions and terms of any length.
The main idea of the approach of including multi-word expressions into topic models is that similar ngrams sharing the same words (e.g, hidden -hidden layer -hidden Markov model -number of hidden units) often belong to the same topics, under one important condition that they often co-occur within the same texts.
To implement the approach, we introduce the sets of similar ngrams and words: w is the lemmatized word, and w 1 . . . w n is the lemmatized ngram. While adding ngrams to the vocabulary as single tokens, we decrease the frequencies of unigram components by the frequencies of encompassing ngrams in each document d.
The resulted frequencies are denoted as n dw .
The pseudocode of the resulting LDA-SIM algorithm is presented in Algorithm 1.
So, if similar ngrams co-occur within the same document, we sum up their frequencies during calculation of probabilities, trying to carry similar ngrams and words to the same topics. Otherwise we make no modification to the original algorithm.
Then we hypothesized that it is possible to automatically choose the most suitable ngrams to incorporate into topic models. For this purpose we can compose all possible ngrams from the top elements from each previously inferred topic and further incorporate them into a topic model (e.g., we can compose "support vector machine" from the top words "machine", "vector", "support"). To be precise, we can choose the most frequent ngram that can be composed from the given set of words.
To verify this hypothesis, we propose the novel LDA-ITER algorithm that utilizes the LDA and LDA-SIM algorithms (Algorithm 2). In fact, there is some similarity in extracting ngrams with the approach presented in (Blei and Lafferty, 2009), where the authors visualize topics with ngrams consisting of words mentioned in these topics. But in that approach the authors do not create a new topic model taking into account extracted ngrams. Create sets of similar ngrams and words Run LDA-SIM using set of similar ngrams and words S and vocabulary In the proposed LDA-ITER algorithm we select top-10 elements from each topic at each iteration. We established experimentally that topic coherence does not depend highly on this parameter, while the best value for perplexity is achieved when selecting top-5 or top-7 elements. Nevertheless in all experiments we set this parameter to 10.
We should note that the number of parameters in the proposed algorithms equals to |W ||T | as in the original LDA, where |W | is the size of vocabulary, and |T | is the number of topics (cf. |W | N |T | parameters in the topical n-gram model (Wang et al., 2007), where N is the length of n-grams).

Datasets and evaluation
In our experiments we used English and Russian text collections in different domains (Table 1). As the sources of multi-word terms, we took two real information-retrieval thesauri in the following domains: socio-political (EuroVoc thesaurus comprising 15161 terms) and banking (Russian Banking Thesaurus comprising 15628 terms). We used the Eurovoc thesaurus in the processing of the Europarl and JRC-Acquiz corpora. The Russian Banking Thesaurus was employed for the processing of Russian banking texts.
At the preprocessing step, documents were processed by morphological analyzers. We do not consider function and low frequency words as elements of vocabulary since they do not play a significant role in forming topics. Also we extracted all collocations in the form of the regular expression ((Adj|Noun) + |(Adj|Noun) * (Noun Prep) ? (Adj|Noun) * ) * Noun (similar to the one proposed in (Frantzi and Ananiadou, 1999)). We take into account only such ngrams since topics are mainly identified by noun groups. Also we emphasize that the proposed sets of similar ngrams cannot be formed by prepositions.
As for the quality of the topic models, we consider three intrinsic measures. The first one is Perplexity, which is the standard criterion of topic quality (Daud et al., 2010): where n is the number of all considered words in the corpus, D is the set of documents in the corpus, n dw is the number of occurrences of the word w in the document d, p(w|d) is the probability of appearing the word w in the document d.
Another method of evaluating topic models is topic coherence (TC-PMI) proposed by Newman et al. (2010), which measures the interpretability of topics based on human judgment: where (w 1 , w 2 , . . . , w 10 ) are the top-10 elements in a topic, P (w i ), P (w j ) and P (w j , w i ) are probabilities of w i , w j and ngram (w j , w i ) respectively. Following the idea of Nokel and Loukachevitch (2015), we also used the variation of this measure -TC-PMI-nSIM, which considers top-10 terms, no two of which are from the same set of similar ngrams. To avoid the effect of considering very long ngrams, we took the most frequent item in each found set of similar ngrams.

Experiments
To compare the proposed algorithms with the original one, we extracted all the ngrams in each text corpus. For ranking ngrams we used Term Frequency (TF) and one of the eight context measures: C-Value (Frantzi and Ananiadou, 1999), two versions of NC-Value (Frantzi and Ananiadou, 1997;Frantzi and Ananiadou, 1999), Token-FLR, Token-LR, Type-FLR, Type-LR (Nakagawa and Mori, 2003), and Modified Gravity Count (Nokel and Loukachevitch, 2013). We should note that context measures are the most well-known method for extracting ngrams and multi-word terms.
According to the results of (Lau et al., 2013) we decided to integrate the top-1000 ngrams and multi-word terms into all the topic models under consideration. We should note that in all experiments we fixed the number of topics |T | = 100 and the hyperparameters α t = 50 |T | and β w = 0.01. We conducted experiments with all nine aforementioned measures on all the text collections to compare the quality of the LDA, the LDA with top-1000 ngrams or multi-word terms added as "black boxes" (similar to (Lau et al., 2013)), and the LDA-SIM with the same top-1000 elements.
In Table 2 we present the results of integrating the top-1000 ngrams and multi-word terms ranked by NC-Value (Frantzi and Ananiadou, 1999) for all five text collections. Other measures under consideration demonstrate similar results.
As we can see, there is a huge improvement in topic coherence using the proposed algorithm in all five text collections. This means that the inferred topics become more interpretable. As for  Table 2: Results of integrating top-1000 ngrams and terms ranked by NC-Value into topic models perplexity, there is also a significant improvement compared to LDA with ngrams as "black boxes". Moreover, sometimes the perplexity is even better than in the original LDA, although the proposed algorithm works on the larger vocabularies, which usually leads to the increase of perplexity.
We should note that the results of the ACL and NIPS corpora are a little different. This is because the ACL corpus contains a lot of word segments hyphenated at ends of lines, while the NIPS corpus is relatively small.
At the last stage of the experiments, we compare the iterative and original algorithms. In Table 3 we present the results of the first iteration of the LDA-ITER algorithm (with the numbers of the added ngrams and terms) alongside the LDA.
As we can see, there is also an improvement in the topics, despite the fact that the LDA-ITER algorithm selects much more ngrams than in the experiments with the LDA-SIM. As for the multiword terms, selecting just a few hundreds of them results in the similar or even better topic quality  Table 3: Results of integrating ngrams and multiword terms into the LDA-ITER algorithm than selecting regular ngrams. Thus, it seems very important that in the case of the LDA-ITER algorithm there is no need to select the desired number of integrating ngrams (cf. the LDA-SIM algorithm). We should also note that on the next iterations the results start to hover around the same values of the measures.
In Table 4 we present working time of the LDA-SIM and the first iteration of the LDA-ITER alongside the original LDA. All the algorithms conducted on a notebook with 2.  At the end, as an example of the inferred topics, we present in Table 5 the top-10 elements from the two random topics inferred by the LDA-SIM with 1000 most frequent ngrams and the first iteration of the LDA-ITER on the ACL corpus.

Conclusion
The paper presents experiments on integrating ngrams and multi-word terms along with similar-  ities between them and words into topic models. First, we adapted the existing PLSA-SIM algorithm to the LDA model and ngrams. Then we propose the LDA-ITER algorithm, which allows us to incorporate the most suitable ngrams and multi-word terms. The experiments conducted on five text collections in different domains and languages demonstrate a huge improvement in all the metrics of quality using the proposed algorithms.