Online Multilingual Topic Models with Multi-Level Hyperpriors

For topic models, such as LDA, that use a bag-of-words assumption, it becomes especially important to break the corpus into appropriately-sized “documents”. Since the models are estimated solely from the term cooccurrences, extensive documents such as books or long journal articles lead to diffuse statistics, and short documents such as forum posts or product reviews can lead to sparsity. This paper describes practical inference procedures for hierarchical models that smooth topic estimates for smaller sections with hy-perpriors over larger documents. Importantly for large collections, these online variational Bayes inference methods perform a single pass over a corpus and achieve better perplexity than “ﬂat” topic models on monolingual and multilingual data. Furthermore, on the task of detecting document translation pairs in large multilingual collections, polylingual topic models (PLTM) with multi-level hyper-priors (mlhPLTM) achieve signiﬁcantly better performance than existing online PLTM models while retaining computational efﬁciency.


Introduction
Bag of words models simplify the representation of documents by discarding grammatical information and simply relying on document-level word cooccurrence statistics. Topic models, such as latent Dirichlet allocation (LDA) (Blei et al., 2003), use this representation. A major drawback of the bag of words representation, especially in collections of large documents, is that the word co-occurrence statistics are computed on a document level and as such they do not capture the effect of words co-occurring close to each other versus words cooccurring further apart.
One alternative approach to longer documents that has received attention in the past has been to directly model local-i.e., Markov-dependencies among tokens. For example, the topical n-gram model (TNG) introduced by Wang et al. (2007) models unigram and n-gram phrases as mixture of topics based on the nearby word context. More recently, Jameel & Lam (2013) proposed an LDA extension that uses word sequence information to generate topic distribution over n-grams and performs topic segmentation using segment and paragraph information. While these and many other approaches offer a better and more realistic modeling of word sequences, they don't model topical variations across document sections either in mono-or multilingual collections.
In this paper, we focus on hierarchical models for improving topic models of long documents. In the past, document-topic based hierarchical prior structures have been explored for LDA. For example,  showed that Gibbs sampling implementation of asymmetric Dirichlet priors provide better modeling of documents, across the whole collection, compared to the original LDA approach. More recently, Kim et al. (2013) introduced tiLDA, a topic model of monolingual document collections with nested hierarchies. In order to achieve reasonable performance over large document collections with deep hierarchies, tiLDA utilizes parallel variational Bayes (VB) inference. While VB is known to converge faster than Gibbs sampling, and paral-lel implementations are even faster, they, as with Gibbs sampling, still require multiple iterations over the whole collection besides the overhead of parallelizing the model parameters. Furthermore these approaches focus on monolingual collections.
We propose an online VB inference approach for topic models that captures the document specific effect of local and long range word co-occurrence by modeling individual document sections using multilevel Dirichlet prior structure. The proposed models assign Dirichlet priors to individual document sections that are coupled by a document level hierarchical Dirichlet prior which facilitates explicit modeling of the variation in topics across documents in mono-and multilingual collections. This in turn streamlines the use of topic models in collections of large documents where there is a predetermined section structure. Our contribution is twofold: (1) we present an online VB inference approach for topic models with multi-level Dirchlet prior structure and more importantly (2) introduce a polylingual topic model (PLTM) with multi-level hyperpriors (mlhPLTM) which is capable of efficiently modeling topical variations across document sections in large multilingual collections.

Efficient Multi-level Hyperpriors
The original LDA model and its multilingual variant, PLTM, use symmetric Dirichlet priors over the document-topic distributions θ d and topic-word distributions ϕ k which means that the concentration parameter α of the Dirichlet distribution is fixed and that the base measure u across all topics is uniform. Symmetric Dirichlet priors assume that all documents in the collection are drawn from the same family of distributions. This assumption is not suitable for collections of documents that cover a diverse set of topics. In the past this issue has been addressed with asymmetric priors where the base measures are non-uniform. One way to assign asymmetric priors to individual documents is to treat the base measures vector u as a hidden variable and assign a symmetric Dirichlet prior to it which creates a hierarchical Dirichlet prior structure over all document-topic distributions in the collection.  Figure 1: mlhLDA: Graphical representation (left); Free variational parameters for the online VB approximation (right).
topic distribution θ d , we introduce section-topic distributions θ s . The existing symmetric Dirichlet prior over θ d creates a hierarchical Dirichlet prior over θ s (θ = θ d , θ s 1 , θ s 2 , ..., θ s S ): In this setting the most widely used approach for estimating θ d is Minka's (2000) fixed-point iteration approach which is also used in (Kim et al., 2013). Instead we use a more efficient approach for estimating the Dirichlet-multinomial hyperparameters by approximating the digamma differences in Minka's approach which was showcased in (Wallach, 2008) to be more efficient. Figure 1 shows the graphical model representation (left) of our model, which we refer to as multi-level hyperpriors LDA (mlhLDA), along with the free variational parameters for approximating the posteriors (right).

Inference using Online VB
Due to its ease of implementation, the most widely used approach for inferring LDA posterior distributions is Gibbs sampling (Griffiths and Steyvers, 2004). For example, this approach was used by  and was originally used for PLTM. On the other hand the VB approach (Blei et al., 2003) offers more efficient computation but as in the case of Gibbs sampling requires iterating over the whole collection multiple times (e.g. Kim et al. (2013)). More recently Hoffman et al. (2010) introduced online LDA (oLDA) that relies on online stochastic optimization and requires a single pass over the whole collection. The same approach was also extended to PLTM (oPLTM) (Krstovski and Smith, 2013). In our work we also utilize online VB to implement multi-level hyperprior (mlh) structure in LDA and PLTM. Similar to batch VB, in online VB locally optimal values of the free variational parameters γ and φ, which are used to approximate the posterior θ and z, are computed in the E step of the algorithm but on a batch b of documents d i (rather than the whole collection D as in the case of batch VB) while holding the topic-word variational parameter λ fixed. In the M step, λ is updated using stochastic gradient algorithm by first computing the optimal values ofλ using the batch optimal values of This value is then combined with value of λ computed on the previous batch through weighted average: (2) When computing the section-topic variational parameters we follow the proof of the lower bound which was derived by Kim et al. (2013). This lower bound, which is looser than the original VB Evidence Lower Bound (ELBO), allows for the batch VB approach to be used with asymmetric priors. More specifically, given the document-topic variational parameter γ dk in the E step of our online VB approach the update for the section-topic variational parameter γ sk becomes:

Online PLTM with multi-level Dirichlet Priors
Given an aligned multilingual document tuple, PLTM assumes that: (1) there exists a single tuplespecific distribution across topics and (2) sets of language specific topic-word distributions. Each word is generated from a language-and topic-specific multinomial distribution ϕ l k as selected by the topic assignment variable z l n : We extend this model by introducing sections specific topic distributions θ s across the different languages in the tuple which are coupled by the tuple specific document-topic distribution θ d .
Given a collection of document tuples d where each tuple contains l documents that are translations of each other in different languages, mlhPLTM assumes the following generative process. For each language l in the collection the model first generates a set of k ∈ {1, 2, ..., K} topic-word distribu- Figure 2: mlhPLTM: Graphical model representation. Figure 3: mlhPLTM: Graphical representation of the free variational parameters for the online VB approximation.
tions, ϕ l k which are drawn from a Dirichlet prior with language specific hyperparameter β l : ϕ l k ∼ Dirichlet(β l ). For each document d l with s d sections in tuple d, mlhPLTM then assumes the following generative process: • For each section s d in document tuple d:

Modeling Sections in Scientific Articles
We explore the ability of mlhLDA to model variations across document sections found in scientific articles using a collection of journal articles from the Astrophysics Data System (ADS) (Kurtz et al., 2000). Our collection consists of 130k training articles (888,346 sections) and a held-out set of 8,078 articles (54,502 sections). Figure 4 shows an example mlhLDA representation of an ApJ article with 100 topics. Shown on the top is the inferred topic representation of the whole document (θ d ) which, in the mlhLDA model, serves as a prior for the sectiontopic distributions (θ s ). Shown on the bottom are ex-

INTRODUCTION
Blazars are an intriguing class of active galactic nuclei (AGNs), dominated by non-thermal radiation over the entire electromagnetic spectrum. Their emission extends from radio to TeV energies with a broadband spectral energy distribution (SED) typically described by two main components, the first peaking from IR to X-ray energy range in which blazars are the most commonly detected extragalactic sources ...

SUMMARY AND DISCUSSION
We have presented the infrared characterization of a sample of blazars detected in the γ-ray. In order to perform our selection, we considered all the blazars in the ROMA-BZCAT catalog (Massaro et al. 2010) that are associated with a γ-ray source in the 2FGL (The Fermi-LAT Collaboration 2011). Then, we searched for infrared counterparts in the WISE archive adopting the same criteria described in Massaro et al. ...   T=5   T=10   T=20   T=30   T=50  T=70  T=90  T=100   T=5   T=10   T=20   T=30   T=50   T=70  T=90  T=100   T=5   T=10   T=20  T=30   T=70  T=50 T=90 T=100 The left side of Figure 5 shows the held-out perplexity comparison between oLDA and mlhLDA across 13 different topic configurations. For this set of experiments we used the above training set of 130k articles and the set of 8,078 held-out articles. From these comparisons we clearly see the advantage of using the multi-level Dirchlet prior structure. Another way of evaluating topic models is through an extrinsic evaluation task which was not available for this collection. In the case of oLDA, article sections were treated as individual documents. In the original oLDA 1 implementation the per document concentration parameter α d was set to 1 K which we also use in our case for both the symmetric θ d and asymmetric θ s (same goes for PLTM 1 http://www.cs.princeton.edu/˜mdhoffma and mlhPLTM). Since in our case we perform relative comparison between oLDA and mlhLDA we weren't concerned with experimenting with different concentration parameters but we rather used the default one implemented in oLDA.

Rank
With a random subset of 10k training and 1k heldout articles we compared the performance of oLDA and mlhLDA with the original batch VB 2 implementation of Blei et al. (2003). Unlike the implementations of oLDA and mlhLDA which are written in Python the original VB algorithm is written in C and requires multiple iterations over the whole collection. The right side of Figure 5 shows the speed (in natural log scale) vs. perplexity comparison across the three models.

Modeling and Retrieving Speeches in Europarl Sessions
We compared the modeling performance of oPLTM and mlhPLTM on a subset of the English-Spanish Europarl collection (Koehn, 2005). The subset consists of ∼64k training pairs of English-Spanish speeches that are translations of each other which originate from 374 sessions of the European Parliament (Europarl) and a test set of ∼14k speech translation pairs from 112 sessions. With oPLTM we modeled individual speech pairs while with mlh-PLTM we utilized the session hierarchy and modeled pairs of speeches as document sections. Comparisons were performed intrinsically (using perplexity) and extrinsically on a cross-language information retrieval (CLIR) task. This task, along with the Europarl subset, have been previously defined by  and used across other publications (Platt et al., 2010;Krstovski and Smith, 2013). Given a query English speech, the CLIR task is to retrieve its Spanish translation equivalent. It involves performing comparison across topic representations of all Spanish speeches using Jensen-Shannon divergence and sorting the results. Models are evaluated using precision at rank one (P@1). Figure 6 shows the CLIR task performance comparisons results using 13 different topic configurations. We performed comparisons across three different settings of the concentration parameters α d and α s (α d =α s = 1 K , 0.4 and 1.0). Across the different concentration parameter values and across the 13 different topic configurations we observe that the performance of oPLTM fluctuates as we increase the numbers of topics. On the other hand, across the three different concentration parameter settings, mlhPLTM performance is very steady and tends to increase with the number of topics. Across the different topic configurations both models provide the best performance with α d = α s = 0.4. Setting the concentration parameters to 1 K gives the overall worst performance. In our initial experiments we unintentionally reordered our set of training Europarl sessions based on two digit years which was different from the experimental setup in  and (Krstovski and Smith, 2013) where the order of the presentation data (Europarl speeches) was chronological. This emphasized the fact that in online VB, order of presentation of documents plays an important role especially in the training step where the model learns the per topic-word distributions. Figure 7 shows the performance comparison results between oPLTM and mlhPLTM when documents in the training and test steps are ordered numerically. In our initial experimental setup concentration parameters where set to α d = α s = 1 K . To the left is the perplexity comparison between the two models. The CLIR task performance comparisons results are shown on the right. Unordered mlhPLTM achieves high P@1 after 2,000 topics. While it takes much longer in terms of the number of topics unordered mlhPLTM ultimately achieves similar performance results as ordered mlhPLTM. Documents were presented out of chronological order and thus performance is lower, especially for oPLTM.

Conclusion
We presented online topic models with multi-level Dirichlet prior structure that provide better modeling of topical variations across document sections in mono-and multilingual collections. We showed that documents with rich sub-document level structure could be modeled with higher likelihood compared to regular online LDA and PLTM models while offering the same efficiency. Furthermore on the task of retrieving document translations we showed that mlhPLTM achieves significantly better retrieval results compared to online PLTM.