Bilingual Segmented Topic Model

This study proposes the bilingual segmented topic model (BiSTM), which hierarchically models documents by treating each document as a set of segments, e.g., sections. While previous bilingual topic models, such as bilingual latent Dirichlet allocation (BiLDA) (Mimno et al., 2009; Ni et al., 2009), consider only cross-lingual alignments between entire documents, the proposed model considers cross-lingual alignments between segments in addition to document-level alignments and assigns the same topic distribution to aligned segments. This study also presents a method for simultaneously inferring latent topics and segmentation boundaries, incorporating unsupervised topic segmentation (Du et al., 2013) into BiSTM. Experimental results show that the proposed model signiﬁcantly out-performs BiLDA in terms of perplexity and demonstrates improved performance in translation pair extraction (up to +0.083 extraction accuracy).


Introduction
Probabilistic topic models, such as probabilistic latent semantic analysis (PLSA) (Hofmann, 1999) and latent Dirichlet allocation (LDA) , are generative models for documents that have been used as unsupervised frameworks to discover latent topics in document collections without prior knowledge. These topic models were originally applied to monolingual data; however, various recent studies have proposed the use of probabilistic topic models in multilingual set-  Figure 1: Wikipedia Article Example tings 1 , where latent topics are shared across multiple languages. These models have improved several multilingual tasks, such as translation pair extraction and cross-lingual text classification (see the survey paper by Vulić et al. (2015) for details). Most multilingual topic models, including bilingual LDA (BiLDA) (Mimno et al., 2009;Ni et al., 2009), model a document-aligned comparable corpus, such as a collection of Wikipedia articles, where aligned documents are topically similar but are not direct translations 2 . In particular, these models assume that the documents in each tuple share the same topic distribution and that each cross-lingual topic has a language-specific word distribution.
Existing multilingual topic models consider only document-level alignments. However, most documents are hierarchically structured, i.e., a document comprises segments (e.g., sections and paragraphs) that can be aligned across languages. Figure 1 shows a Wikipedia article example, which contains a set of sections. Sections 1, 2, and 3 in the English article correspond topically to sections 4, 2, and 3 in the Japanese counterpart, re-spectively. To date, such segment-level alignments have been ignored; however, we consider that such corresponding segments must share the same topic distribution. Du et al. (2010) have shown that segment-level topics and their dependencies can improve modeling accuracy in a monolingual setting. Based on that research, we expect that segment-level topics can also be useful for modeling multilingual data.
This study proposes a bilingual segmented topic model (BiSTM) that extends BiLDA to capture segment-level alignments through a hierarchical structure. In particular, BiSTM considers each document as a set of segments and models a document as a document-segment-word structure. The topic distribution of each segment (per-segment topic distribution) is generated using a Pitman-Yor process (PYP) (Pitman and Yor, 1997), in which the base measure is the topic distribution of the related document (per-document topic distribution). In addition, BiSTM introduces a binary variable that indicates whether two segments in different languages are aligned. If two segments are aligned, their per-segment topic distributions are shared; if they are not aligned, they are independently generated.
BiSTM leverages existing segments from a given segmentation. However, a segmentation is not always given, and a given segmentation might not be optimal for statistical modeling. Therefore, this study also presents a model, BiSTM+TS, that incorporates unsupervised topic segmentation into BiSTM. BiSTM+TS integrates point-wise boundary sampling into BiSTM in a manner similar to that proposed by Du et al. (2013) and infers segmentation boundaries and latent topics jointly.
Experiments using an English-Japanese and English-French Wikipedia corpus show that the proposed models (BiSTM and BiSTM+TS) significantly outperform the standard bilingual topic model (BiLDA) in terms of perplexity, and that they improve performance in translation extraction (up to +0.083 top 1 accuracy). The experiments also reveal that BiSTM+TS is comparable to BiSTM, which uses manually provided segmentation, i.e., section boundaries in Wikipedia articles.

Bilingual LDA
This section describes the BiLDA model (Mimno et al., 2009;Ni et al., 2009), which we take as choose w l im ∼ p(w l im |z l im , ϕ l ) 12: end for 13: end for 14: end for our baseline. BiLDA is a bilingual extension of basic monolingual LDA  for a document-aligned comparable corpus. While monolingual LDA assumes that each document has its own topic distribution, BiLDA assumes that aligned documents share the same topic distribution and discovers latent cross-lingual topics.
Algorithm 1 and Figure 2 show the generative process and graphical model, respectively, of BiLDA. BiLDA models a document-aligned comparable corpus, i.e., a set of D document pairs in two languages, e and f . Each document pair d i (i ∈ {1, ..., D}) comprises aligned documents in the language e and f : BiLDA assumes that each topic k ∈ {1, ..., K} comprises the set of a discrete distribution over words for each language. Each language-specific per-topic word distribution ϕ l k (l ∈ {e, f }) is drawn from a Dirichlet distribution with the prior β l (Steps 1-5). To generate a document pair d i , the perdocument topic distribution θ i is first drawn from a Dirichlet distribution with the prior α (Step 7). Thus, aligned documents d e i and d f i share the same topic distribution. Then, for each word at m ∈ {1, ..., N l i } in document d l i in language l, a latent topic assignment z l im is drawn from a multinomial Figure  for each word w l ijm (m ∈ {1, ..., N l ij }) do 20: choose z l ijm ∼ Multinomial(ν ig ) 21: choose w l ijm ∼ p(w l ijm |z l ijm , ϕ l ) 22: end for 23: end for 24: end for 25: end for distribution with the prior θ i (Step 10). Later, a word w l im is drawn from a probability distribution p(w l im |z l im , ϕ l ) given the topic z l im (Step 11).

Bilingual Segmented Topic Model
Here, we describe BiSTM, which extends BiLDA to capture segment-level alignments. Algorithm 2 and Figure 3 show the generative process and graphical model, respectively, of BiSTM. As can be seen in Figure 3, BiSTM introduces a segmentlevel layer between the document-and word-level layers in both languages. In other words, persegment topic distributions for each language, ν e and ν f , are introduced between per-document topic distributions θ and topic assignments for words, z e and z f . In addition, BiSTM incorporates binary variables y to represent segment-level alignments.
Each document d l i in a pair of aligned documents d i is divided into S l i segments: d l i = ∪ S l i j=1 s l ij . BiSTM makes the same assumption for per-topic word distributions as BiLDA, i.e., ϕ l k are language-specific and drawn from Dirichlet distributions (Steps 1-5).
In the generative process for a document pair d i , the per-document topic distribution θ i is first drawn in the same way as in BiLDA (Step 7). Thus, in BiSTM, each document pair shares the same topic distribution. Then, if segment-level alignments are not given, y i are generated (Steps 8-11). We assume that each document pair d i has a probability γ i that indicates comparability between segments across languages. γ i is drawn from a Beta distribution with the priors η 0 and η 1 (Step 9). Then, each of y i is drawn from a Bernoulli distribution with the prior γ i (Step 10). Here, y ijj ′ = 1 if and only if s e ij and s f ij ′ are aligned; otherwise, y ijj ′ = 0. Note that if segment-level alignments are observed, then Steps 8-11 are skipped. Later, a set of aligned segment sets AS i is generated based on y i (Step 12). For example, given and y i12 are 1, and the other y's are 0, Step 12. Then, for each aligned segment set AS ig (g ∈ {1, ..., |AS i |}), the per-segment topic distribution ν ig is obtained from a Pitman-Yor process with the base measure θ i , the concentration parameter a, and the discount parameter b (Step 14). Through Steps 12-15, aligned segments indicated by y share the same per-segment topic distribution. For instance, s e i1 , s f i1 , and s f i2 have the same topic distribution ν i1 ∼ PYP(a, b, θ i ) in the above example.
Then, for each word at m ∈ {1, ..., N l ij } in segment s l ij in document d l i in language l, a latent topic assignment z l ijm is drawn from a multinomial distribution with the prior ν ig (Step 20), where g denotes the index of the element set of AS i that includes the segment s l ij , e.g., g for s f i2 is 1. Subsequently, a word w l ijm is drawn based on the assigned topic z l ijm and the language-specific per-topic word distribution ϕ l in the same manner as in BiLDA (Step 21). t igk Table count of topic k in the CRP for aligned segment set g in document pair i.
Total table count in aligned segment set g in document pair i, i.e., ∑ k t igk . n igk Total number of words with topic k in aligned segment set g in document pair i. n ig· Total number of words in aligned segment set g in document pair i, i.e.,
Here, a language-dependent variable without a superscript denotes both of the variable in language e and that in f , e.g., z = {z e , z f }. Unfortunately, as in other probabilistic topic models, such as LDA and BiLDA, we cannot compute this posterior using an exact inference method. This section presents an approximation method for BiSTM based on blocked Gibbs sampling, inspired by Du et al. (2013). In our inference, the hierarchy in BiSTM, i.e., the generation of ν and z, is explained by the Chinese restaurant process (CRP), through which the parameters θ, ν, and ϕ are integrated out, and the statistics on table counts in the CRP, t, are introduced. Table 1 lists all statistics used in our inference, where W l denotes a vocabulary set in language l. Moreover, to accelerate convergence, we introduce an auxiliary binary variable δ l ijm for w l ijm , indicating whether w l ijm is the first customer on a table (δ l ijm = 1) or not (δ l ijm = 0), and t igk is computed based on δ in the same manner as in : is a function that returns 1 if the condition x is true and 0 otherwise. Our inference groups z l ijm and δ l ijm (each group is called a "block") and jointly samples them.
Moreover, if y is not observed, our inference alternates two different kinds of blocks, (z l ijm , δ l ijm ) and y ijj ′ . In each sampling, individual variables are resampled, conditioned on all other variables. In the following, we describe each sampling stage.
Sampling (z, δ): The joint posterior distribution of z, w, and δ is induced in a manner similar to that in Du et al. (2010;: p(z, w, δ|α, β, a, b, y) where Beta K (·) and Beta W l (·) are K-and |W l |dimensional beta functions, respectively, (b|a) n is the Pochhammer symbol 3 , and (b) n is given by (b|1) n . S(n, m, a) is a generalized Stirling number of the second kind (Hsu and Shiue, 1998), which is given by the linear recursion S(n + 1, m, a) = S(n, m − 1, a) + (n − ma)S(n, m, a).
To reduce computational cost, the Stirling numbers are preliminarily calculated in a logarithm format (Buntine and Hutter, 2012). Then, the cached values are used in our sampling. The joint conditional distributions of z l ijm and δ l ijm are obtained from the above joint distribution using Bayes' rule: Sampling y: In our inference, each aligned segment set corresponds to a restaurant in the CRP. We regard the sampling of y ijj ′ as the choice of splitting or merging restaurant(s) in a manner similar to that in the sampling of segmentation boundaries in Du et al. (2013). In particular, if y ijj ′ = 0, then one aligned segment set AS m is split into two aligned segment sets AS l and AS r , where AS l , AS r , and AS m include s e ij , s f ij ′ , and both, respectively. If y ijj ′ = 1, then AS l and AS r are merged to AS m . For simplicity, our inference specifies AS l and AS r based on the current y as follows is the element set of AS i that includes the segment j, and AS l i (j) is the set of segments in language l included in AS i (j). For example, in the example in Section 3, i2 , s f i3 }. The conditional distributions of y ijj ′ are as follows: p(y ijj ′ = 0|y −y ijj ′ , z, w, δ, α, a, b, η 0 , η 1 ) where T is the set of t igk such that for either or both of AS l and AS r , t igk = 1. c i0 and c i1 are the total number of y i 's whose values are 0 and that of y i 's whose values are 1, respectively. Note that we change y i 's that relate to the selected action (merging or splitting), in addition to y ijj ′ to maintain consistency between y and the aligned segment sets.

Integration of Topic Segmentation into BiSTM (BiSTM+TS)
To infer segmentation boundaries simultaneously with cross-lingual topics, we integrate the unsupervised Bayesian topic segmentation method proposed by Du et al. (2013) into the proposed BiSTM (BiSTM+TS).
We assume that each segment is a sequence of topically-related passages. In particular, we consider a sentence as a passage. Our segmentation model defines a segment in document d l i by a boundary indicator variable ρ l ih for each passage u l ih (h ∈ {1, ..., U l i }); ρ l ih is 1 if there is a boundary after passage u l ih (otherwise 0). For example, ρ l i = (0, 1, 0, 0, 1) indicates that the document d l i comprises the two segments {u l i1 , u l i2 } and {u l i3 , u l i4 , u l i5 }. Algorithm 3 shows the generative process for segments. The generative process of BiSTM+TS inserts Algorithm 3 between Steps 7 and 8 of Algorithm 2. Note that two documents (d e i , d f i ) ∈ d i are segmented independently. BiSTM+TS assumes that each document d l i has its own topic shift probability π l i . For each document d l i , π l i is first drawn from a Beta distribution with the priors λ 0 and λ 1 (Step 2). Then, for each passage u l ih (h ∈ {1, ..., U l i }), ρ l ih is drawn from a Bernoulli distribution with the prior π l i (Step 4). Finally, segments s l i are generated by concatenating passages based on ρ l i (Step 6).

Inference for BiSTM+TS
Our inference for BiSTM+TS alternates three different kinds of blocks, sampling of ρ and samplings for BiSTM ((z, δ) and y). The conditional distribution of ρ comprises the Gibbs probability for splitting one segment s m into two segments s r and s l by placing the boundary after u l ih (ρ l ih = 1) and that for merging s r and s l to s m by removing the boundary after u l ih (ρ l ih = 0). These probabilities are estimated in the same manner as the conditional probabilities of y ijj ′ , where y (y ijj ′ = 0, 1), AS l , AS r , AS m , η 0 , and η 1 are replaced with ρ (ρ l ih = 1, 0), s l , s r , s m , λ 1 , and λ 0 , respectively, and the statistics t and n are summed for every segment rather than for every aligned segment set (see Equation (6) and (9) in Du et al. (2013)).
Our inference assumes that sampling ρ does not depend on aligned segments in the other language, i.e., y 4 . After splitting or merging, we set the y's of s m , s l , and s r as follows: if s m is split into s l and s r , then AS(s l ) = AS(s m ) and AS(s r ) = AS(s m ); if s l and s r are merged to s m , then AS(s m ) = AS(s l ) ∪ AS(s r ).

Experiment
We evaluated the proposed models in terms of perplexity and performance in translation pair extraction, which is a well-known application that uses a bilingual topic model. We used a document-aligned comparable corpus comprising 3,995 document pairs, each of which is a Japanese Wikipedia article in the Kyoto Wiki Corpus 5 and its corresponding English Wikipedia article 6 . Note that the English articles were collected from the English Wikipedia database dump (2 June 2015) 7 based on inter-language links, even though the original Kyoto Wiki corpus is a parallel corpus, in which each sentence in the Japanese articles is manually translated into English. Thus, our experimental data is not a parallel corpus. We extracted texts from the collected English articles using an open-source script 8 . All Japanese and English texts were segmented using MeCab 9 and TreeTagger 10 (Schmid, 1994), respectively. Then, function words were removed, and the remaining words were lemmatized to reduce data sparsity.
For translation extraction experiments, we automatically created a gold-standard translation set according to Liu et al. (2013). We first computed p(w e |w f ) and p(w f |w e ) by running IBM Model 4 on the original Kyoto Wiki corpus, which is a parallel corpus, using GIZA++ (Och and Ney, 2003), and then extracted word pairs (ŵ e ,ŵ f ) that satisfy both of the following conditions:ŵ e = argmax w e p(w e |w f =ŵ f ) and w f = argmax w f p(w f |w e =ŵ e ). Finally, we eliminated word pairs that do not appear in the document pairs in the document-aligned comparable corpus. We used all 7,930 Japanese words in the resulting gold-standard set as the evaluation input.

Competing Methods
We compared the proposed models (BiSTM and BiSTM+TS) with a standard bilingual topic model (BiLDA). BiSTM considers each section in Wikipedia articles as a segment. Note that alignments between sections are not given in our experimental data. Thus, y is inferred in both BiSTM and BiSTM+TS.
As in the proposed models, BiLDA was trained using Gibbs sampling (Mimno et al., 2009;Ni et al., 2009;Vulić et al., 2015). In the training of each model, each variable was first initialized. Here, z l ijm is randomly initialized to an integer between 1 and K, and each of δ l ijm , y ijj ′ , and ρ l ih is randomly initialized to 0 or 1. We then performed 10,000 Gibbs iterations. We used the symmetric prior α k = 50/K and β l w = 0.01 over θ and ϕ l , respectively, in accordance with Vulić et al. (2011). The hyperparameters a, b, λ 0 , and λ 1 were set to 0.2, 10, 0.1, and 0.1, respectively, in accordance with Du et al. (2010;. Both η 0 and η 1 were set to 0.2 as a result of preliminary experiments. We used several values of K to measure the impact of topic size: we used K = 100 and K = 400 in accordance with Liu et al. (2013) in addition to the suggested value K = 2, 000 in Vulić et al. (2011).
In the translation extraction experiments,  Table 2: Test Set Perplexity we used two translation extraction methods, i.e., Cue (Vulić et al., 2011) and Liu (Liu et al., 2013). Both methods first infer crosslingual topics for words using a bilingual topic model (BiLDA/BiSTM/BiSTM+TS) and then extract word pairs (w e , w f ) with a high value of the probability p(w e |w f ) defined by the inferred topics. Cue calculates and p(w|k) = ϕ kw .
Liu first converts a document-aligned comparable corpus into a topic-aligned parallel corpus according to the topics of words and computes p(w e |w f , k) by running IBM Model 1 on the parallel corpus.
Liu then calculates p(w e |w f ) = ∑ K k=1 p(w e |w f , k)p(k|w f ). Hereafter, a bilingual topic model used in an extraction method is shown in parentheses, e.g., Cue(BiLDA) denotes Cue with BiLDA.

Experimental Results
We evaluated the predictive performance of each model by computing the test set perplexity based on 5-fold cross validation. A lower perplexity indicates better generalization performance. Table  2 shows the perplexity of each model. As can be seen, BiSTM and BiSTM+TS are better than BiLDA in terms of perplexity.
We measured the performance of translation extraction with top N accuracy (ACC N ), the number of test words whose top N translation candidates contain a correct translation over the total number of test words (7,930). Table 3 summarizes ACC 1 and ACC 10 for each model. As can be seen, Cue/Liu(BiSTM) and Cue/Liu(BiSTM+TS) significantly outperform Cue/Liu(BiLDA) (p < 0.01 in the sign test). This indicates that BiSTM and BiSTM+TS improve the performance of translation extraction for both the Cue and Liu methods by assigning more suitable topics.
Both experiments prove that capturing segmentlevel alignments is effective for modeling bilingual data. In addition, these experiments show that BiSTM+TS is comparable with BiSTM, indicat-    Tables 2 and 3 show that a larger topic size yields better performance for each model. Furthermore, Liu outperforms Cue regardless of the choice of bilingual topic models, which is consistent with previously reported results (Liu et al., 2013). The results of our experiments demonstrate that the proposed models have the same tendencies as BiLDA.

Inferred Segment-level Alignments
We created a reference set to evaluate segmentlevel alignments y inferred by BiSTM (K=2,000). We randomly selected 100 document pairs from the comparable corpus and then manually identified cross-lingual alignments between sections. Table 4 shows the distribution of inferred y values and that of y values in the reference set. As can be seen, the accuracy of y is 0.859 (1,327/1,544).
The majority of false negatives (121/174) are sections that are not parallel but correspond partially. An example is the alignment between the Model Japanese article English article BiSTM 4.8 2.9 BiSTM+TS 10.6 4.1 Japanese section "history" and the English section "Bujutsu (old type of Budo)" in the "Budo (a Japanese martial art)" article pair, where a part of the English section "Bujutsu" is described in the Japanese section "history." Such errors might not necessarily have a negative effect, because partial alignments can be useful.

Inferred Segmentation Boundaries
This section compares segment boundaries inferred by BiSTM+TS (K=2,000) with section boundaries in the original articles, which have been referred to by BiSTM. The recall of BiSTM+TS for the original section boundaries is 0.727. This indicates that the unsupervised segmentation in BiSTM+TS finds drastic topical changes, i.e., section boundaries, with high recall. Table 5 shows the average number of segments per article for each model. As can be seen, BiSTM+TS divides an article into segments smaller than the original sections. This seems to be reasonable, because some original sections include multiple topics. However, Tables 2 and 3 show that inferred boundaries do not work better than section boundaries. One reason for that is that some errors are caused by a sparseness problem, when BiSTM+TS separates an article into extremely fine-grained segments. In addition, Table  5 reveals that BiSTM+TS increases the gap between languages. Thus, segmentation with a comparable granularity between languages might be favorable for the proposed models.

Effectiveness for an English-French Wikipedia Corpus
We evaluated BiLDA, BiSTM, and BiSTM+TS in terms of perplexity and performance in translation extraction on an English-French Wikipedia corpus to verify the effectiveness of the proposed models for language pairs other than English-Japanese. The settings, e.g., parameters, for each model are the same as in Section 5. Note that we report only the performances of each model with K = 2, 000, because all models achieved the best performances when K = 2, 000. We collected French articles that correspond to the English articles used in the experiments in Section 5, from the French Wikipedia database dump (2 June 2015) based on inter-language links. As a result, our English-French corpus comprises 3,159 document pairs. The French articles were preprocessed in the same manner as the English articles: text extraction using the open-source script, segmentation using TreeTagger, removal of function words, and lemmatization.
We created a gold-standard translation set for translation extraction experiments using Google Translate service 11 in a manner similar to that in Gouws et al. (2015) and Coulmance et al. (2015), translating the French words in our corpus using Google Translate, and then eliminating word pairs that do not appear in the document pairs in our corpus. We used the top 1,000 most frequent French words in the resulting gold-standard set as the evaluation input. Table 6 summarizes ACC 1 , ACC 10 , and perplexity. It shows that the proposed models are effective also for the English-French Wikipedia corpus. BiSTM and BiSTM+TS outperform BiLDA in terms of perplexity and performance of translation extraction, and BiSTM+TS works well even if the boundaries of segments are unknown.

Related Work
Multilingual topic models other than BiLDA (Section 2) have been proposed for document-aligned comparable corpora. Fukumasu et al. (2012) applied SwitchLDA (Newman et al., 2006) and Correspondence LDA , which were originally intended to work with multimodal data, such as annotated image data, to modeling multilingual text data. They also proposed a symmetric version of Correspondence LDA. Platt et al. (2010) projected monolingual models based on PLSA or Principal Component Analysis into a shared multilingual space with the constraint that document pairs must map to similar locations. Hu et al. (2014) proposed a multilingual tree-based topic model that uses a hierarchical bilingual dictionary in addition to document alignments. Note that these models do not consider segment-level alignments.
There are several multilingual topic models tailored for data other than a document-aligned comparable corpus, including bilingual topic models for word alignment and machine translation on parallel sentence pairs (Zhao and Xing, 2006;Zhao and Xing, 2008). Some models have mined multilingual topics from unaligned text data by bridging the gap between different languages using a bilingual dictionary (Jagarlamudi and Daumé III, 2010;Zhang et al., 2010;Negi, 2011). Boyd-Graber and Blei (2009) used parallel sentences in combination with a bilingual dictionary. However, these models have the drawback that they require a parallel corpus or a bilingual dictionary in advance, which cannot be obtained for some language pairs or domains.
In a monolingual setting, some topic models that consider segment-level topics have been proposed. Du et al. (2010) considered a document as a set of segments and generated each per-segment topic distribution from the topic distribution of the related document through a Pitman-Yor process. Others have considered a document as a sequence of segments. Cheng et al. (2009) reflected the underlying sequences of segments' topics by positing a permutation distribution over a document. Wang et al. (2011) modeled topical sequences in documents with a latent first-order Markov chain, and Du et al. (2012) generated each per-segment topic distribution from the topic distribution of its document and that of its previous segment. Note that none of these models have been extended to a multilingual setting.

Conclusions
In this paper, we proposed BiSTM, which models a document hierarchically and deals with segmentlevel alignments. BiSTM assigns the same topic distribution to both aligned documents and aligned segments. We also presented an extended model, BiSTM+TS, that infers segmentation boundaries in addition to latent topics by incorporating unsupervised topic segmentation (Du et al., 2013). Our experimental results show that capturing segmentlevel alignments improves perplexity and translation extraction performance, and that BiSTM+TS yields a significant benefit even if the boundaries of segments are not given.
This paper presented an extension to BiLDA, but hierarchical structures can also be incorporated into other bilingual topic models (Section 7). As future work, we would like to verify the effectiveness of the proposed models for other datasets or other cross-lingual tasks, such as cross-lingual document classification (Ni et al., 2009;Platt et al., 2010;Ni et al., 2011;) and cross-lingual information retrieval (Vulić et al., 2013).