A Multilingual Topic Model for Learning Weighted Topic Links Across Corpora with Low Comparability

Multilingual topic models (MTMs) learn topics on documents in multiple languages. Past models align topics across languages by implicitly assuming the documents in different languages are highly comparable, often a false assumption. We introduce a new model that does not rely on this assumption, particularly useful in important low-resource language scenarios. Our MTM learns weighted topic links and connects cross-lingual topics only when the dominant words defining them are similar, outperforming LDA and previous MTMs in classification tasks using documents’ topic posteriors as features. It also learns coherent topics on documents with low comparability.

Prior models work well because they implicitly assume-even if not part of the model-parallel or highly comparable data with well-aligned topics. However, this assumption does not always comport with reality. Even documents from the same place and time can discuss very different things across languages: in multicultural London, Hindi tweets focus on a Bollywood actor's BBC appearance, French blogs fret about Brexit, and English articles focus on Tottenham's lineup. Generally, corpora have a range of "nonparallelness" (Fung, 2000). In less comparable settings, while some * Now at Facebook † Now at Google AI Zürich  Figure 1: Topic pairs with many word translation pairs have high link weights, e.g., (EN-1,  and (EN-2, ZH-4); topic pairs with partial overlap receive lower weights, e.g., (EN-4, ZH-1); a topic is unlinked if there is no corresponding topic in the other language (ZH-2).
topics are shared, languages' emphasis may diverge and some topics may lack analogs. We therefore introduce a new multilingual topic model that assumes each language has its own topic sets and jointly learns all topics, but does not force one-to-one alignment across languages. Instead, our MTM learns weighted topic links across languages and only assigns a high link weight to a topic pair whose top words have many direct translation pairs ( Figure 1). Moreover, it allows unlinked topics if there is no matching topic in the other language. This makes the model robust for (more common) less-comparable data with topic misalignment. Joint inference also allows insights from high-resource languages to uncover low-resource language patterns. It is particularly useful in scenarios that involve modeling topics on low-resource languages in humanitarian assistance, peacekeeping, and/or infectious disease response, while limiting the additional cost to other steps that will also need to be taken, such as finding or creating a word translation dictionary.
We validate the MTM in two classification tasks using inferred topic posteriors as features. Our MTM has higher F1 than other models in both intra-and cross-lingual evaluations, while discovering coherent topics and meaningful topic links.

Multilingual Topic Model for
Connecting Cross-Lingual Topics Yang et al. (2015) present a flexible framework for adding regularization to topic models. We extend this model to the multilingual setting by adding a potential function that links topics across languages. For simplicity of exposition, we focus on the bilingual case with languages S and T . Unlike Yang et al. (2015) that encode monolingual information only, our potential function encodes multilingual knowledge parameterized by two matrices, ρ S→T and ρ T →S , that transform topics between the two languages. Cells' values are between 0 and 1 and a cell ρ S→T,k T ,k S close to one is a strong connection of topics k T and k S in language T and S. Transformations ρ are learned from translation pairs' topic distributions.
These topic distributions come from the assignments of Gibbs sampling (Griffiths and Steyvers, 2004). Fortunately adding the potential function is equivalent to adding an additional term to Gibbs sampling for topic models (Yang et al., 2015). During sampling, each token is assigned to a topic, so we can compute a post hoc word distribution over topics. The probability of a topic k given a word w is Pr (k | w) ≡ Ω w,k ≡ N k,w /N w , where N k,w is the number of times that word w is assigned to topic k and N w is w's term frequency.
To find good topic links ρ S→T , we use a dictionary. For instance, given the translation pair of "sports" and "运 动 (yùn dòng)", they should have similar topic distributions, so we want ρ EN→ZH Ω sports to be close to Ω 运动 and vice versa. Moreover, the transformations should be symmetric: ρ S→T Ω w S close to Ω w T , and vice versa. We encode this cross-lingual knowledge of topic transformations into the potential function Ψ which measures the difference of translation pairs' topic distributions after transformation: where η c is the statistical importance of the c-th translation pair to the corpus ( Figure 2, full details in the Supplement).
While Yang et al. (2015) provide a blueprint for Gibbs sampling with potential functions without Figure  additional parameters, our model has additional parameters of ρ S→T and ρ T →S so we need to optimize them. Thus, we use stochastic EM (Celeux, 1985). The E-step updates tokens' topic assignments using Gibbs sampling, while holding the parameters of the topic link weight matrices ρ fixed. The M-step optimizes ρ while holding the topic assignments fixed. We optimize Ψ in log space using the objective function J(ρ S→T ) as which is minimized by using L-BFGS (Liu and Nocedal, 1989), with the partial derivatives with re-

Experiments
We evaluate our model extrinsically on classification tasks, followed by intrinsic topic coherence.

Classification with Topic Posteriors
We use two datasets for classification: Wikipedia documents in English (EN) and Chinese (ZH) (Yuan et al., 2018) and an English-Sinhalese (SI) disaster response dataset (Strassel and Tracey, 2016  Wikipedia document is labeled with one of the topics of film, music, animals, politics, religion, and food. A portion of the disaster response documents are labeled with one of eight types of needed rescue resources: evacuation, food supply, search/rescue, utilities, infrastructure, medical assistance, shelter, and water supply. We follow Yuan et al. (2018) for preprocessing (such as lemmatization for English and segmentation for Chinese) and use a linear SVM for classification. For the Wikipedia dataset, we report micro-F1 scores on a six-way classification. For the disaster response dataset, our goal is binary classification of the need for evacuation versus other assistance. The classification uses features of topic posteriors: Pr (k | d) ≡ N d,k /N d which is the proportion of the tokens assigned to topic k in document d.
Our evaluations are both intra-and crosslingual. The intra-lingual evaluation trains and tests classifiers on the same language, while the cross-lingual evaluation trains classifiers on one language and tests on another. In cross-lingual evaluations, MTAnchor, MCTA, and ptLDA align topic spaces, so topic posterior transformation is not necessary. LDA cannot transform topic spaces, so we do not apply any transformation. For our MTM, we explore two transformation methods with ρ. The first multiplies ρ with a language's document topic distributions, i.e., ρ ZH→EN θ ZH and vice versa. The second (TOP), transfers each document topic's probability mass to the topic in the other language with the highest link weight. 3 Our MTM has higher F1 both intra-and crosslingually ( Figure 3). TF-IDF weighting on translation pairs sometimes improves the intra-lingual F1, although it hurts the cross-lingual F1. Connecting the top linked topics (TOP) is better than directly using the topic link weight matrices. This indicates that ρ's values have some noise.

Looking at Learned Topics
Past MTMs align topics across languages but our MTM does not, so we compare the topics across models to see how they differ. We look at the Movies topics from the Wikipedia dataset (Table 1). For the Chinese MTM topics, we show the three English topics with the highest link weights.
The topics are about Movies, but the MCTA and MTAnchor topics do not rank "movie" or "电影 (diàn yǐng)" at the top. The ptLDA topics, although aligned well, incorrectly align some Chinese words. "胶 片 (jiāo piàn)" means "photographic film", while "释放 (shì fàng)" means release as in "let something go", not movie distribution. ptLDA links words based on translations  without looking at the context, which causes problems with multiple-sense words. The LDA and MTM topics are generally coherent.
The MTM's unique joint modeling of weighted topic links also recovers additional topical structure: after linking respective EN-ZH Movies topics, the next linked topics are Action Movies ("kill", "death", "attack", and "escape"). Further, the models capture a degree of connection between Movies and Computer Games (MTM + TF-IDF) and Japanese Animations (MTM).

Evaluating Topic Coherence
We intrinsically evaluate models' topic coherence on two sets of preprocessed bilingual Wikipedia corpora (Hao and Paul, 2018) that vary in "non-parallelness". Both pair English with Arabic, Chinese, Spanish, Farsi, and Russian. In PACO, 30% of documents have direct translations across languages, and in INCO none has direct translations. Dictionaries are extracted fromn Wiktionary. 5 Standard preprocessing has already been applied to the datasets, including stemming, stop word removal, and high-frequency word removal.
We use an intra-lingual metric to evaluate topic coherence (Lau et al., 2014): for every topic, we compute its top N words' average pairwise PMI score on a disjoint subset of Wikipedia documents (Hao and Paul, 2018). We report average coherence with N from 10 to 100 with a step size of 10 (five-fold cross-validation). We use the same translation pair weighting options as in our classification tasks and also compare against monolingual LDA and ptLDA (Hu et al., 2014).
MTM is no worse than LDA and sometimes slightly better (Figure 4). TF-IDF weighting on translation pairs sometimes further improves coherence (e.g., Arabic, Farsi, Russian, and Spanish on INCO) but occasionally hurts (e.g., Chinese). ptLDA mostly works poorly, except on Farsi with a high number of top words. ptLDA aligns topic spaces, which is hard for low-comparability data, thus sacrificing coherence for alignment; in contrast MTM only connects topics when they align well in senses.

Topic Coherence vs. Target Language Corpora Sizes
We next vary the size of target language (non-English languages in PACO and INCO) corpora: how much can MTM help topic coherence for lowresource languages? We start from 10% of the randomly-selected documents in target languages and incrementally add more target language documents at a step size of 10% until it reaches 100%. Meanwhile, we always use 100% of the English documents. We train monolingual LDA, ptLDA, and MTMs with and without TF-IDF weighting on translation pairs on each setting and evaluate the topic coherence on the same reference corpora using the top thirty words of each topic ( Figure 5). In most cases, the topic coherence improves with larger target corpora, except Arabic and Russian on PACO. This confirms our intuition that more data yield a better topic model. MTM is help- Target Language Corpora Size Ratios Spanish Figure 5: The models' topic coherence on INCO and PACO datasets when the sizes of target language corpora grow from 10% to 100%, with a step size of 10%.
ful in cases when the target language corpora sizes are small, e.g., Chinese and Russian with 10% or 20% of the corpora. TF-IDF weighting is not consistently better or worse than equal weights. The ptLDA with tree priors based on dictionaries performs poorly in topic coherence, except Farsi in INCO. In most cases, its topic coherence is substantially below others' and improves little when the target corpora grow.