Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Multilingual topic models enable document analysis across languages through coherent multilingual summaries of the data. However, there is no standard and effective metric to evaluate the quality of multilingual topics. We introduce a new intrinsic evaluation of multilingual topic models that correlates well with human judgments of multilingual topic coherence as well as performance in downstream applications. Importantly, we also study evaluation for low-resource languages. Because standard metrics fail to accurately measure topic quality when robust external resources are unavailable, we propose an adaptation model that improves the accuracy and reliability of these metrics in low-resource settings.


Introduction
Topic models provide a high-level view of the main themes of a document collection (Boyd-Graber et al., 2017).Document collections, however, are often not in a single language, driving the development of multilingual topic models.These models discover topics that are consistent across languages, providing useful tools for multilingual text analysis (Vulić et al., 2015), such as detecting cultural differences (Gutiérrez et al., 2016) and bilingual dictionary extraction (Liu et al., 2015).
Monolingual topic models can be evaluated through likelihood (Wallach et al., 2009b) or coherence (Newman et al., 2010), but topic model evaluation is not well understood in multilingual settings.Our contributions are two-fold.We introduce an improved intrinsic evaluation metric for multilingual topic models, called Crosslingual Normalized Pointwise Mutual Information (cnpmi, Section 2).We explore the behaviors of cnpmi at both the model and topic levels with six language pairs and varying model specifications.This metric correlates well with human judgments and crosslingual classification results (Sections 5 and 6).
We also focus on evaluation in low-resource languages, which lack large parallel corpora, dictionaries, and other tools that are often used in learning and evaluating topic models.To adapt cnpmi to these settings, we create a coherence estimator (Section 3) that extrapolates statistics derived from antiquated, specialized texts like the Bible: often the only resource available for many languages.

Evaluating Multilingual Coherence
A multilingual topic contains one topic for each language.For a multilingual topic to be meaningful to humans (Figure 1), the meanings should be consistent across the languages, in addition to coherent within each language (i.e., all words in a topic are related).This section describes our approach to evaluating the quality of multilingual topics.After defining the multilingual topic model, we describe topic model evaluation extending standard monolingual approaches to multilingual settings.

Multilingual Topic Modeling
Probabilistic topic models associate each document in a corpus with a distribution over latent topics, while each topic is associated with a distribution over words in the vocabulary.The most widely used topic model, latent Dirichlet allocation (Blei et al., 2003, lda), can be extended to connect languages.These extensions require additional knowledge to link languages together.
One common encoding of multilingual knowledge is document links (indicators that documents are parallel or comparable), used in polylingual topic models (Mimno et al., 2009;Ni et al., 2009).In these models, each document d indexes a tuple of parallel/comparable language-specific documents, d ( ) , and the language-specific "views" of a document share the document-topic distribution θ d .The generative story for the document-links model is: Alternatively, word translations (Jagarlamudi and Daumé III, 2010), concept links (Gutiérrez et al., 2016;Yang et al., 2017), and multi-level priors (Krstovski et al., 2016) can also provide multilingual knowledges.Since the polylingual topic model is the most common approach for building multilingual topic models (Vulić et al., 2013(Vulić et al., , 2015;;Liu et al., 2015;Krstovski and Smith, 2016), our study will focus on this model.

Monolingual Evaluation
Most automatic topic model evaluation metrics use co-occurrence statistics of word pairs from a reference corpus to evaluate topic coherence, assuming that coherent topics contain words that often appear together (Newman et al., 2010).The most successful (Lau et al., 2014) is normalized pointwise mutual information (Bouma, 2009, npmi).npmi compares the joint probability of words appearing together Pr(w i , w j ) to their probability assuming independence Pr(w i ) Pr(w j ), normalized by the joint probability: log Pr(w i , w j ) . (1) The word probabilities are calculated from a reference corpus, R, typically a large corpus such as Wikipedia that can provide meaningful co-occurrence patterns that are independent of the target dataset.The quality of topic k is the average npmi of all word pairs (w i , w j ) in the topic: where W(k, C) are the C most probable words in the topic-word distribution φ k (the number of words is the topic's cardinality).Higher npmi k means the topic's top words are more coupled.

Existing Multilingual Evaluations
While automatic evaluation has been wellstudied for monolingual topic models, there are no robust evaluations for multilingual topic models.We first consider two straightforward metrics that could be used for multilingual evaluation, both with limitations.We then propose an extension of npmi that addresses these limitations.
Internal Coherence.A simple adaptation of npmi is to calculate the monolingual npmi score for each language independently and take the average.We refer this as internal npmi (inpmi) as it evaluates coherence within a language.However, this metric does not consider whether the topic is coherent across languagesthat is, whether a language-specific word distribution φ 1 k is related to the corresponding distribution in another language, φ 2 k .

Crosslingual
Consistency.Another straightforward measurement is Matching Translation Accuracy (Boyd-Graber and Blei, 2009, mta), which counts the number of word translations in a topic between two languages using a bilingual dictionary.This metric can measure whether a topic is well-aligned across languages literally, but cannot capture non-literal more holistic similarities across languages.

New Metric: Crosslingual npmi
We extend npmi to multilingual models, with a metric we call crosslingual normalized pointwise mutual information (cnpmi).This metric will be the focus of our experiments.A multilingually coherent topic means that if w i, 1 in language 1 and w j, 2 in language 2 are in the same topic, they should appear in similar contexts in comparable or parallel corpora R (1 ,2 ) .Our adaptation of npmi is based on the same principles as the monolingual version, but focuses on the co-occurrences of bilingual word pairs.Given a bilingual word pair (w i, 1 , w j, 2 ) the co-occurrence of this word pair is the event where word w i, 1 appears in a document in language 1 and the word w j, 2 appears in a comparable or parallel document in language 2 .
The co-occurrence probability of each bilingual word pair is: where is a pair of parallel/comparable documents in the reference corpus R ( 1 , 2 ) .When one or both words in a bilingual pair do not appear in the reference corpus, the co-occurrence score is zero.Similar to monolingual settings, cnpmi for a bilingual topic k is the average of the npmi scores of all C 2 bilingual word pairs, It is straightforward to generalize cnpmi from a language pair to multiple languages by averaging cnpmi( i , j , k) over all language pairs ( i , j ). 3 Adapting to Low-Resource Languages cnpmi needs a reference corpus for cooccurrence statistics.Wikipedia, which has good coverage of topics and vocabularies is a common choice (Lau and Baldwin, 2016).Unfortunately, Wikipedia is often unavailable or not large enough for low-resource languages.It only covers 282 languages, 1 and only 249 languages have more than 1,000 pages: many of pages are short or unlinked to a high-resource language.Since cnpmi requires comparable documents, the usable reference corpus is defined by paired documents.
Another option for a parallel reference corpus is the Bible (Resnik et al., 1999), which is available in most world languages; 2 however, it is small and archaic.It is good at evaluating topics such as family and religion, but not "modern" topics like biology and Internet.Without reference co-occurrence statistics relevant to these topics, cnpmi will fail to judge topic coherence-it must give the ambiguous answer of zero.Such a score could mean a totally incoherent topic where each word pair never appears together (Topics 6 in Figure 1), or an unjudgeable topic (Topic 5).
Our goal is to obtain a reliable estimation of topic coherence for low-resource languages when the Bible is the only reference.We propose a model that can correct the drawbacks of a Bible-derived cnpmi.While we assume bilingual topics paired with English, our approach can be applied to any high-resource/lowresource language pair.
We take Wikipedia's cnpmi from high- resource languages as accurate estimations.We then build a coherence estimator on topics from high-resource languages, with the Wikipedia cnpmi as the target output.We use linear regression using the below features.Given a topic in low-resource language, the estimator produces an estimated coherence (Figure 2).

Estimator Features
The key to the estimator is to find features that capture whether we should trust the Bible.For generality, we focus on features independent of the available resources other than the Bible.This section describes the features, which we split into four groups.
Base Features (base) Our base features include information we can collect from the Bible and the topic model: cardinality C, cnpmi and inpmi, mta, and topic word coverage (twc), which counts the percentage of topic words in a topic that appear in a reference corpus.
Crosslingual Gap (gap) A low cnpmi score could indicate a topic pair where each language has a monolingually coherent topic but that are not about the same theme (Topic 6 in Figure 1).Thus, we add two features to capture this information using the Bible: mismatch coefficients (mc) and internal comparison coefficients (icc): where α is a smoothing factor (α = 0.001 in our experiments).mc recognizes the gap between crosslingual and monolingual coherence, so a higher mc score indicates a gap between coherence within and across languages.Similarly, icc compares monolingual coherence to tell if both languages are coherent: the closer to 1 the icc is, the more comparable internal coherence both languages have.
Word Era (era) Because the Bible's vocabulary is unable to evaluate modern topics, we must tell the model what the modern words are.
The word era features are the earliest usage year3 for each word in a topic.We use both the mean and standard deviation as features.
Meaning Drift (drift).The meaning of a word can expand and drift over time.For example, in the Bible, "web" appears in Isaiah 59:5: They hatch cockatrice' eggs, and weave the spider's web.
The word "web" could be evaluated correctly in an animal topic.For modern topics, however, Bible fails to capture modern meanings of "web", as in Topic 5 (Figure 1).
To address this meaning drift, we use a method similar to Hamilton et al. (2016).For each English word, we calculate the context vector from Bible and from Wikipedia with a window size of five and calculate the cosine similarity between them as word similarity.Similar context vectors mean that the usage in the Bible is consistent with Wikipedia.We calculate word similarities for all the English topic words in a topic and use the average and standard deviation as features.

Example
In Figure 3, Topic 1 is coherent while Topic 8 is not.From left to right, we incrementally add new feature sets, and show how the estimated topic coherence scores (dashed lines) approach the ideal cnpmi (dotted lines).When only using the base features, the estimator gives a higher prediction to Topic 8 than to Topic 1. Their low mta and twc prevent accurate evaluations.Adding gap does not help much.However, icc(en, am, k = 1) is much smaller, which might indicate a large gap of internal coherence between the two languages.
Adding era makes the estimated scores flip between the two topics.Topic 1 has word era of 1823, much older than Topic 8's word era of 1923, indicating that Topic 8 includes modern words the Bible lacks (e.g., "computer").Using all the features, the estimator gives more accurate topic coherence evaluations.

Experiments: Bible to Wikipedia
We experiment on six languages (Table 1) from three corpora: Romanian (ro) and Swedish (sv) from EuroParl as representative of wellstudied and rich-resource languages (Koehn, 2005); Amharic (am) and Tagalog (tl) from collected news, as low-resource languages (Huang et al., 2002a,b); and Chinese (zh) and Turkish (tr) from TED Talks 2013 (Tiedemann, 2012), adding language variety to our experiments.Each language is paired with English as a bilingual corpus.
Typical preprocessing methods (stemming, stop word removal, etc.) are often unavailable for low-resource languages.For a meaningful comparison across languages, we do not apply any stemming or lemmatization strategies, including English, except removing digit numbers and symbols.However, we remove words that appear in more than 30% of documents for each language.
Each language pair is separately trained using the MALLET (McCallum, 2002) implementation of the polylingual topic model.Each experiment runs five Gibbs sampling chains with 1,000 iterations per chain with twenty topics.The hyperparameters are set to the default values (α = 0.1, β = 0.01), and are optimized every 50 iterations in MALLET using slice sampling (Wallach et al., 2009a).

Evaluating Multilingual Topics
We use Wikipedia and the Bible as reference corpora for calculating co-occurrence statistics.Different numbers of Wikipedia articles are available for each language pair (Table 1), while the Bible contains a complete set of 1,189 chapters for all of its translations (Christodoulopoulos and Steedman, 2015).We use Wiktionary as the dictionary to calculate mta.

Training the Estimator
In addition to experimenting on Wikipediabased cnpmi, we also re-evaluate the topics' Bible coherence using our estimator.In the following experiments, we use an AdaBoost regressor with linear regression as the coherence estimator (Friedman, 2002;Collins et al., 2000).The estimator takes a topic and low-quality cnpmi score as input and outputs (hopefully) an improved cnpmi score.
To make our testing scenario more realistic, we treat one language as our estimator's test language and train on multilingual topics from the other languages.We use three-fold crossvalidation over languages to select the best hyperparameters, including the learning rate and loss function in AdaBoost.R2 (Drucker, 1997).

Topic-Level Evaluation
We first study cnpmi at the topic level: does a particular topic make sense?An effective evaluation should be consistent with human judgment of the topics (Chang et al., 2009).In this section, we measure gold-standard human interpretability of multilingual topics to rights, government, newspaper, country, justice, democratic ፕሬስ (press), ነፃ (free), ጋዜጣ (newspaper), መብት (right), ጋዜጠኞች (journalists), ሕዝብ (people), ሥርዓት (system) Are these two groups of words talking about the same thing?

Yes Somewhat No
Figure 4: The interface for topic quality judgments.Users read the topic first, and make a judgment on whether the words in this pair are talking about the same thing.The translations are here for illustration; they are not shown to the users.establish which automatic measures of topic interpretability work best.

Task Design
Following monolingual coherence evaluations (Lau et al., 2014), we present topic pairs to bilingual CrowdFlower users.Each task is a topic pair with the top ten topic words (C = 10) for each language.We ask if both languages' top words in a multilingual topic are talking about the same concept (Figure 4), and make a judgment on a three-point scale-coherent (2 points), somewhat coherent (1 point), and incoherent (0 points).To ensure the users have adequate language competency, we insert several topics that are easily identifiable as incoherent as a qualification test.
We randomly select sixty topics from each language pair (360 topics total), and each topic is judged by five users.We take the average of the judgment points and calculate Pearson correlations with the proposed evaluation metrics (Table 2).npmi-based scores are separately calculated from each reference corpus.Table 3: Correlations between the Wikipediabased cnpmi and the Bible-based cnpmi, before and after using the coherence estimator, at the topic level.Strong correlations indicate that the estimator improves cnpmi estimates.

Agreement with Human Judgments
cnpmi (the extended metric) has higher correlations with human judgments than inpmi (the naive adaptation of monolingual npmi), while mta (matching translation accuracy) correlations are comparable to cnpmi.Unsurprisingly, when using Wikipedia as the reference, the correlations are usually higher than when using the Bible.The Bible's archaic content limits its ability to estimate human judgments in modern corpora (Section 3).
Next, we compare cnpmi to two baselines: inpmi and mta.As expected, cnpmi outperforms inpmi regardless of reference corpus overall, because inpmi only considers monolingual coherence.mta has higher correlations than cnpmi scores from the Bible, because the Bible fails to give accurate estimates due to limited topic coverage.mta, on the other hand, only depends on dictionaries, which are more comprehensive than the Bible.It is also possible that users are judging coherence based on translations across a topic pair, rather than the overall coherence, which would closely correlate with mta.

Re-Estimating Topic-Level Coherence
The Bible-by itself-produces cnpmi values that do not correlate well with human judgments (Table 2).After training an estimator (Section 4.2), we calculate Pearson's correlation between Wikipedia's cnpmi and the estimated topic coherence score (  accurate coherence.
As a baseline, the correlation of Bible-based cnpmi without adaptation has negative and near-zero correlations with Wikipedia;4 it does not capture coherence.After training the estimator, the correlations become stronger, indicating the estimated scores are closer to Wikipedia's cnpmi.

When mta Falls Short
We analyze mta from two aspects-the inability to capture semantically-related nontranslation topic words, and insensitivity to cardinality-to show why mta is not an ideal measurement, even though it correlates well with human judgments.
Semantics We take two examples with enzh (Topic 1) and en-tl (Topic 2) in Figure 5. Topic 1 has fewer translation pairs than Topic 2, which leads to a lower mta score for Topic 1.However, all words in Topic 1 talk about art, while it is hard to interpret Topic 2. Wikipedia cnpmi scores reveals Topic 1 is more coherent.Because our experiments are on datasets with little divergence between the themes discussed across languages, this is uncommon for us but could appear in noisier datasets.
Cardinality Increasing cardinality diminishes a topic's coherence (Lau and Baldwin, 2016).We vary the cardinality of topics from ten to fifty at intervals of ten (Figure 6).As cardinality increases, more low-probability and irrelevant words appear the topic, which lowers cnpmi scores.However, mta stays stable or increases with increasing cardinality.Thus, mta fails to fulfill a critical property of topic model evaluation.
Finally, mta requires a comprehensive multilingual dictionary, which may be unavailable for low-resource languages.Additionally, most languages often only have one dictionary, which makes it problematic to use the same resource (a language's single multilingual dictionary) for training and evaluating models that use a dictionary to build multilingual topics (Hu et al., 2014).Given these concerns, we continue the paper's focus on cnpmi as a data-driven alternative to mta.However, for many applications mta may suffice as a simple, adequate evaluation metric.

Model-Level Evaluation
While the previous section looked at individual topics, we also care about how well cnpmi characterizes the quality of models through an average of a model's constituent topics.

Training Knowledge
Adding more knowledge to multilingual topic models improves topics (Hu et al., 2014), so an effective evaluation should reflect this improvement as knowlege is added to the model.For polylingual topic models, this knowledge takes the form of the number of linked documents.
We start by experimenting with no multilingual knowledge: no document pairs share a topic distribution θ d (but the documents are in the collection as unlinked documents).We then increase the number of document pairs that share θ d from 20% of the corpus to 100%.Fixing the topic cardinality at ten, cnpmi captures the improvements in models (Figure 7) through a higher coherence score.

Agreement with Machines
Topic models are often used as a feature extraction technique for downstream machine learning applications, and topic model evaluations should reflect whether these features are useful (Ramage et al., 2009).For each model, we apply a document classifier trained on the model parameters to test whether cnpmi is consistent with classification accuracy.
Specifically, we want our classifier to transfer information from training on one language to testing on another (Smet et al., 2011;Heyman et al., 2016).We train a classifier on one Increasing cardinality of topic pairs makes it harder to judge the coherence.Decreasing cnpmi scores reflect the diminished interpretability of topics, while mta scores do not.Table 4: At the model level, the estimator improves correlations between cnpmi and downstream classification for all languages except for Turkish.
language's documents, where each document's feature vector is the document-topic distribution θ d .We apply this to ted Talks, where each document is labeled with multiple categories.We choose the most frequent seven categories across the corpus as labels, 5 and only have labeled documents in one side of a bilingual topic model.cnpmi has very strong correlations with classification results, though using the Bible as the reference corpus gives slightly lower correlation-with higher variance-than Wikipedia (Figure 8). Figure 8: Pearson correlation between classification F1 scores and cnpmi: both cnpmi data sources predict whether a classifier using topic features will work well, but Wikipedia has slightly higher correlation with lower variance.

Re-Estimating Model-Level Coherence
In Section 5.3, we improve Bible-based cnpmi scores for individual topics.Here, we show the estimator also improves model-level coherence.
We apply the estimator on the models created in Section 6.2 and calculate the correlation between estimated scores and Wikipedia's cnpmi (Table 4).
The coherence estimator substantially improves scores except for Turkish: the correlation is better before applying the estimator (0.911).We suspect a lack of overlap between topics between Turkish and languages other than Chinese is to blame (Figure 9); the features used by the estimator do not generalize well to other kinds of features; training on many languages pairs would hopefully solve this issue.Turkish is also morphologically rich, and our preprocessing completely ignores morphology.

Reference Size
One challenge with low-resource languages is that even if Wikipedia is available, it may have too few documents to accurately calculate coherence.As a final analysis, we examine how the reliability of cnpmi degrades with a smaller reference corpus.
Figure 10: cnpmi is stable once the number of reference documents is large enough (around five thousand documents).
We randomly sample 20% to 100% of document pairs from the reference corpora and evaluate the polylingual topic model with all document links (Figure 10), again fixing the cardinality as 10.
cnpmi is stable across different amounts of reference documents, as long as the number of reference documents is sufficiently large.If there are too few reference documents (for example, 20% of Amharic Wikipedia is only 316 documents), then cnpmi degrades.

Related Work
Topic Coherence Many coherence metrics based on co-occurrence statistics have been proposed besides npmi.Similar metrics-such as asymmetrical word pair metrics (Mimno et al., 2011) and combinations of existing measurements (Lau et al., 2014;Röder et al., 2015)correlate well with human judgments.npmi has been the current gold standard for evaluation and improvements of monolingual topic models (Pecina, 2010;Newman et al., 2011).
External Tasks Another approach is to use a model for predictive tasks: the better the results are on external tasks, the better a topic model is assumed to be.A common task is heldout likelihood (Wallach et al., 2009b;Jagarlamudi and Daumé III, 2010;Fukumasu et al., 2012), but as Chang et al. (2009) show, this does not always reflect human interpretability.Other specific tasks have also been used, such as bilingual dictionary extraction (Liu et al., 2015;Ma and Nasukawa, 2017), cultural difference deteciton (Gutiérrez et al., 2016), and crosslingual document clustering (Vulić et al., 2015).
Representation Learning Topic models are one example of a broad class of techniques of learning representations of documents (Bengio et al., 2013).Other approaches learn respresentations at the word (Klementiev et al., 2012;Vyas and Carpuat, 2016), paragraph (Mogadala and Rettinger, 2016), or corpus level (Søgaard et al., 2015).However, neural representation learning approaches are often data hungry and not adaptable to low-resource languages.The approaches here could help improve the evaluation of all multilingual representation learning algorithms (Schnabel et al., 2015).

Conclusion
We have provided a comprehensive analysis of topic model evaluation in multilingual settings, including for low-resource languages.While evaluation is an important area of topic model research, no previous work has studied evaluation of multilingual topic models.Our work provided two primary contributions to this area, including a new intrinsic evaluation metric, cnpmi, as well as a model for adapting this metric to low-resource languages without large reference corpora.
As the first study on evaluation for multilingual topic models, there is still room for improvement and further applications.For example, human judgment is more difficult to measure than in monolingual settings, and it is still an open question on how to design a reliable and accurate survey for multilingual quality judgments.As a measurement of multilingual coherence, we plan to extend cnpmi to high-dimensional representations, e.g., multilingual word embeddings, particularly in lowresource languages (Ruder et al., 2017).
bbn Technologies, by darpa award hr0011-15-c-0113.Boyd-Graber and Paul were supported by nsf grant iis-1564275.Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsors.

Figure 2 :
Figure 2: The coherence estimator takes multilingual topics and features from them then outputs an estimated topic coherence.

Figure 3 :
Figure3: As the estimator adds additional features, the estimated topic coherence scores (solid lines) approach to Wikipedia cnpmi (dashed lines).

Figure 5 :
Figure 5: mta fails to capture semantically related words (Topic 1) and only looks at translation pairs regardless of internal coherence (Topic 2).

Figure 9 :
Figure 9: The overlap of topics and domain: only one out of nine Turkish and Chinese topics have domain overlap with Tagalog and Amharic topics.This hinders the Turkish estimator from capturing model-level properties.

Table 1 :
Number of document pairs in the training and reference datasets and number of dictionary entries for each language pair.

Table 2 :
Pearson correlations between human judgments and cnpmi are higher than inpmi, while mta correlations are comparable to cnpmi.

Table 3
).A higher correlation with Wikipedia's cnpmi means more Figure 7: Adding more document links to the model produces more multilingually coherent topics.cnpmi captures this improvement.