Detecting Common Discussion Topics Across Culture From News Reader Comments

News reader comments found in many on-line news websites are typically massive in amount. We investigate the task of C ultural-common T opic D etection (CTD), which is aimed at discovering common discussion topics from news reader comments written in different languages. We propose a new probabilistic graphical model called MCTA which can cope with the language gap and capture the common semantics in different languages. We also develop a partially collapsed Gibbs sampler which effectively incorporates the term translation relationship into the detection of cultural-common topics for model parameter learning. Experimental results show improvements over the state-of-the-art model.


Introduction
Nowadays the rapid development of information and communication technology enables more and more people around the world to engage in the movement of globalization. One effect of globalization is to facilitate greater connections between people bringing cultures closer than before. This also contributes to the convergence of some elements of different cultures (Melluish, 2014). For example, there is a growing tendency of people watching the same movie, listening to the same music, and reading the news about the same event. This kind of cultural homogenization brings the emergence of commonality of some aspects of different cultures worldwide. It would be beneficial to identify such common aspects among cultures. For example, it can provide some insights for global market and international business (Cavusgil et al., 2014).
Many news websites from different regions in the world report significant events which are of interests to people from different continents. These websites also allow readers around the world to give their comments in their own languages. The volume of comments is often enormous especially for popular events. In a news website, readers from a particular culture background tend to write comments in their own preferred languages. For some important or global events, we observe that readers from different cultures, via different languages, express common discussion topics. For instance, on March 8 2014, Malaysia Airlines Flight MH370, carrying 227 passengers and 12 crew members, disappeared. Upon the happening of this event, many news articles around the world reported it and many readers from different continents commented on this event. Through analyzing the reader comments manually, we observe that both English-speaking and Chinese-speaking readers expressed in their corresponding languages their desire for praying for the MH370 flight. This is an example of a cultural-common discussion topic. Identifying such cultural-common topics automatically can facilitate better understanding and organization of the common concerns or interests of readers with different language background. Such technology can be deployed for developing various applications. One application is to build a reader comment digest system that can organize comments by cultural-common discussion topics and rank the topics by popularity. This provides a functionality of analyzing the common focus of readers from different cultures on a particular event. An example of such application is shown in Figure 3. Under each event, reader comments are grouped by cultural-common topics.
In this paper, we investigate the task of Cultural-common Topic Detection (CTD) on multilingual news reader comments. Reader comments about a global event, written in different languages, from different news websites around the world exist in massive amount. The main goal of this task is to discover cultural-common discussion topics from raw multilingual news reader comments for a news event. One challenge is that the discussion topics are unknown. Another challenge is related to the language gap issue. Precisely, the words of reader comments in different languages are composed of different terms in their corresponding languages. Such language gap issue poses a great deal of challenge for identifying cultural-common discussion topics in multilingual news comments settings.
One recent work done by Prasojo et al. (2015) is to organize news reader comments around entities and aspects discussed by readers. Such organization of reader comments cannot handle the identification of common discussion topics. On the other hand, the Muto model proposed by Boyd-Graber and Blei (2009) can extract common topics from multilingual documents. This model merely outputs cross-lingual topics of matching word pairs. One example of such kind of topic contains key terms of word pairs such as "plane:飞 机 ocean:海洋 . . . ". The assumption of one-toone mapping of words has some drawbacks. One drawback is that the correspondence of identified common topics is restricted to the vocabulary level. Another drawback is that the one-toone mapping of words cannot fit the original word occurrences well. For example, the English term "plane" appears in the English documents frequently while the Chinese translation "飞机" appears less. It is not reasonable that "plane" and "飞 机" share the same probability mass in common topics. Another closely related existing work is the PCLSA model proposed by . PCLSA employs a mixture of English words and Chinese words to represent common topics. It incorporates bilingual constraints into the Probabilistic Latent Semantic Analysis (PLSA) model (Hofmann, 2001) and assumes that word pairs in the dictionary share similar probability in a common topic. However, similar to one-to-one mapping of words, such bilingual constraints cannot handle well the original word co-occurrence in each language resulting in a degradation of the co-herence and interpretability of common topics.
We propose a new probabilistic graphical model which is able to detect cultural-common topics from multilingual news reader comments in an unsupervised manner. In principle, no labeled data is needed. In this paper, we focus on dealing with two languages, namely, English and Chinese news reader comments. Different from prior works, we design a technique based on auxiliary distributions which incorporates word distributions from the other language and can capture the common semantics on the topic level. We develop a partially collapsed Gibbs sampler which decouples the inference of topic distribution and word distribution. We also incorporate the term translation relationship, derived from a bilingual dictionary, into the detection of cultural-common topics for model parameter learning.
We have prepared a data set by collecting English and Chinese reader comments from different regions reflecting different culture. Our experimental results are encouraging showing improvements over the state-of-the-art model. Prasojo et al. (2015) and Biyani et al. (2015) organized news reader comments via identified entities or aspects. Such kind of organization via entities or aspects cannot capture common topics discussed by readers. Digesting merely based on entities fails to work in multilingual settings due to the fact that the common entities have distinct mentions in different languages. Zhai et al. (2004) discovered common topics from comparable texts via a PLSA based mixture model. Paul and Girju (2009) proposed a Mixed-Collection Topic Model for finding common topics from different collections. Despite the fact that the above models can find a kind of common topic, they only deal with a single language setting without considering the language gap. Some works discover common latent topics from multilingual corpora. For aligned corpora, they assume that the topic distribution in each document is the same (Vulić et al., 2011;Vulić and Moens, 2014;Erosheva et al., 2004;Fukumasu et al., 2012;Mimno et al., 2009;Ni et al., 2009;Zhang et al., 2013;Peng et al., 2014). However, aligned corpora are often unavailable for most domains. For unaligned corpora, cross-lingual topic models use some language resources, such as a bilingual dictionary or a bilingual knowledge base to bridge the language gap (Boyd-Graber and Blei, 2009;Jagarlamudi and Daumé III, 2010). As mentioned above, the goals of Boyd-Graber and Blei (2009) as well as Jagarlamud and Daumé (2010) focus on mining the correspondence of topics at the vocabulary level, which are different from that of  and ours. The model in  adds the constraints of word translation pairs into PLSA. These constraints cannot handle the original word co-occurrences well. In contrast, we consider the language gap by incorporating word distributions from the other language, capturing the common semantics on the topic level. Moreover, we use a fully Bayesian paradigm with a prior distribution.

Related Work
Some existing topic methods conduct crosslingual sentiment analysis (Lu et al., 2011;Guo et al., 2010;Lin et al., 2014;Boyd-Graber and Resnik, 2010). These models are not suitable for our CTD task because they mainly detect common elements related to product aspects. Moreover some works focus more on detecting sentiments.

Model Description
The problem definition of the CTD task is described as follows. For a particular event, both English and Chinese news reader comments are collected from different regions reflecting different culture. The set of English comments is denoted by E and the set of Chinese comments is denoted by C. The goal of the CTD task is to extract cultural-common topics k ∈ {1, 2, . . . , K} from E and C. The set of multilingual news reader comments of each event are processed within the same event.
Our proposed model is called Multilingual Cultural-common Topic Analysis (MCTA) which is based on graphical model paradigm as depicted in Figure 1. The plate on the right represents cultural-common topics. Each cultural-common topic k is represented by an English word distribution ϕ e k over English vocabulary Λ e and a Chinese word distribution ϕ c k over Chinese vocabulary Λ c . We make use of a bilingual dictionary, which is composed of many-to-many word translations among English and Chinese words. To capture common semantics of multilingual news reader comments, we design two auxiliary distri- butions η e , with dimension Λ e , and η c , with dimension Λ c , to help the generation of ϕ e k and ϕ c k . Precisely, we generate η e and η c from the Dirichlet prior distributions Dir(β ·1 |Λ e | ) and Dir(β ·1 |Λ c | ) respectively, where 1 D denotes a D-dimensional vector whose components are 1. Then we draw ϕ e k from the mixture of η e k and the translation of η c k . It is formulated as: where λ ∈ (0, 1) is a parameter which balances the nature of original topics and transferred information from the other language. M c→e is a mapping |Λ c | × |Λ e | matrix from Λ c to Λ e . Each element M c→e ij is the mapping occurrence probability of the English term w e j given the Chinese term w c i in the set of news reader comments. This probability is calculated as: where C(w e j ) is the count of w e j in all news reader comments and T (w c i ) is the set of English translations of w c i found in the bilingual dictionary. The "add-one" smoothing is adopted. Note that the sum of each row is equal to 1. Using the same principle, we can derive ϕ c k which can be formulated as: As a result, the incorporation of η e k and η c k on the topic level encourages the word distribution ϕ e k and ϕ c k to share common semantic components of reader comments in different languages.
The upper left plate in Figure 1 represents English reader comments. N e d denotes the number of English reader comments and N e dw denotes the number of words in the English comment d e . Each English reader comment d e is characterized by a K-dimensional topic membership vector θ e d , which is assumed to be generated by the prior Dir(α · 1 K ). For each word w e n in an English comment d e , we generate the topic z n e from θ e d . We generate the word w e n from the corresponding distribution ϕ e k . The bottom left plate in Figure 1 represents Chinese reader comments. Similarly, we generate the topic distribution θ c d from the prior Dir(α · 1 K ). The topic z n c of each word w c n in a Chinese comment d c is generated from θ c d . We generate word w c n from the corresponding distribution ϕ c k . The generative process is formally depicted as: • For each topic k ∈ K -choose auxiliary distributions η e k ∼ Dir(β· 1 |Λ e | ) and η c k ∼ Dir(β · 1 |Λ c | ) -choose English word distribution ϕ e k and ϕ c k using Eq. 1 and Eq. 3 respectively.
Note that for simplicity, we present our model on the bilingual setting of Chinese and English. It can be extended to multilingual setting via introducing auxiliary distributions for each language. Each topic word distribution for each language is generated by the convex combination of all the auxiliary distributions.

Posterior Inference
In order to decouple the inference of z n and ϕ k for each language, we develop a partially collapsed Gibbs method which just discards θ e d and θ c d . Given ϕ e k , we sample the new assignments of the topic z e di in English news reader comments d e with the following conditional probability: P (z e di = k|z e,¬i , W e , α, ϕ e k ) ∝ (N e,¬i dk + α k ) × ϕ e k (4) where z e,¬i denotes the topic assignments except the assignment of the ith word. N e dk is the number Update ϕ e k , ϕ c k according to Eq. 1 and Eq. 3 15: end for 16: Output θ dk by Eq. 10 of words in English document d e whose topics are assigned to k. Similarly, we sample z c di with the following equation: Given the topic assignments, the probability of the entire comment set can be: kw is the number of words w in English news reader comments assigned to the topic k and N c kw is the number of words w in Chinese news reader comments assigned to the topic k.
Using Eq. 6, we can obtain the posterior likelihood related to η e k and η c k : We optimize Eq. 7 under the constraints of w i ∈Λ e η e kw i = 1 and w i ∈Λ c η c kw i = 1. Using the fixed-point method, we obtain the update equations of η e kwt and η c kwt shown in Eq. 8 and Eq. 9.
Moreover, the posterior estimates for the topic distribution θ d can be computed as follows.
The whole detailed algorithm is depicted in Algorithm 1. When λ = 0, the updated equations of η e k and η c k can be simplified as: Then we have: ϕ e k ∼ Dir(N e kw 1 + β, N e kw 2 + β, . . . ) ϕ c k ∼ Dir(N c kw 1 + β, N c kw 2 + β, . . . ) Therefore, the algorithm degrades to a Gibbs sampler of LDA.

Data Set and Preprocessing
We have prepared a data set by collecting English and Chinese comments from different regions reflecting different culture for some significant events as depicted in Table 1. The English reader comments are collected from Yahoo 1 and the Chinese reader comments are collected from Sina News 2 . We first remove news reader comments whose length is less than 5 words. We remove the punctuations and the stop words. For English comments, we also stem each word to its root

Comparative Methods
The PCLSA model proposed by  can be regarded as the state-of-the-art model for detecting latent common topics from multilingual text documents. We implemented PCLSA as one of the comparative methods in our experiments.
Another comparative model used in the experiment is LDA (Blei et al., 2003), which can generate K English topics and K Chinese topics from English and Chinese reader comments respectively. Then we translate Chinese topics into English topics and use symmetric KL divergence to align translated Chinese topics with original English topics. Each aligned topic pair is regarded as a cultural-common topic.

Experiment Settings
For each event, we partitioned the comments into a subset of 90% for the graphical model parameter estimation. The remaining 10% is used as a holdout data for the evaluation of the CCP metric as discussed in Section 4.4.1. We repeated the runs five times. For each run, we randomly split the comments to obtain the holdout data. As a result, we have five runs for our method as well as comparative methods. We make use of the holdout data of one event, namely the event "MH370 Flight Accident", to estimate the number of topics K for all models and λ in Eq. 1 for our model. The setting of K is described in Section 4.4.3. We set λ = 0.5 after tuning. For hyper-parameters, we set α to 0.5 and β to 0.01. When performing our Gibbs algorithm, we set the maximum iteration number as 1000, and the burn-in sweeps as 100.

Cultural-common Topic Evaluation
We conduct quantitative experiments to evaluate how well our MCTA model can discover culturalcommon topics.

Evaluation Metrics
We use two metrics to evaluate the topic quality. The first metric is the "cross-collection perplexity" measure denoted as CCP which is similar to the one used in . The CCP of high quality cultural-common topics should be lower than those topics which are not shared by the English and Chinese reader comments. The calculation of CCP consists of two steps: 1) For each k ∈ K, we translate ϕ e k into Chinese word distribution T (ϕ e k ) and translate ϕ c k English word distribution T (ϕ c k ). To translate ϕ e k and ϕ c k , we look up the bilingual dictionary and conduct word-toword translation. If one word has several translations, we distribute its probability mass equally to each English translation. 2) We use T (ϕ e k ) to fit the holdout Chinese comments C and T (ϕ c k ) to fit the holdout English comments E using Eq. 13 (Blei et al., 2003). Eq. 13 depicts the calculation of CCP. The lower the CCP value is, the better the performance is.
For each detected common topic, we wish to evaluate the degree of commonality. We design another metric called "topic commonality distance" denoted by TCD. We first evaluate the KLdivergence between the English topic and translated Chinese topic. We also evaluate the KLdivergence between the Chinese topic and translated English topic. Then TCD is computed as the average sum of the two KL-divergences. The lower the TCD measure is, the better the topic is.  Table 2: Topic quality evaluation as measured by CCP The topic detected by PCLSA is a mixture of English and Chinese words. We obtain English representation and Chinese representation of the topic by the conditional probabilities as given in Eq. 14.

Experimental Results
The average CCP values of the three models are shown in Table 2. Our MCTA model achieves the best performance compared with PCLSA and LDA. Both MCTA and PCLSA achieve a better CCP than LDA because they can bridge the language gap in the multilingual news reader comments to some extent. Compared with PCLSA, our MCTA model demonstrates a 4.2% improvement. Our MCTA model provides a better characterization of the collections. One reason is that our MCTA model learns the word distribution of cultural-common topics using an effective topic modeling with a prior Dirichlet distribution. It is similar to the advantage of LDA over PLSA. Moreover, the bilingual constraints in PCLSA cannot handle the original natural word co-occurrence well in each language. In contrast, MCTA represents cultural-common topics as a mixture of the original topics and the translated topics, which capture the comment semantics more effectively. The average TCD of three models are shown in Table 3. Our MCTA outperforms the two comparative methods. The cultural-common topics iden-

Determining Number of Topics
As mentioned in Section 4.3, we use the holdout data of one event to determine K. For each λ ∈ {0.2, 0.5, 0.8}, we vary K in the range of [5,200]. Figure 2 depicts the effect of K on the cross-collection perplexity as measured by CCP. We can see that CCP decreases with the increase of the number of topics. Moreover, through manual inspection we observed that when K is 30 or more, even though CCP decreases, the topics will be repeated. Similar observations for the number of topics can be found in Paul and Girju (2009). Therefore, we set K = 30. We can also see that our model is not very sensitive to the balance parameter λ.

Topic Coherence Evaluation
We also evaluate the coherence of topics generated by PCLSA and MCTA, which indicates the interpretability of topics. Following Newman et al. (2010), we use a pointwise mutual information (PMI) score to measure the topic coherence. We compute the average PMI score of top 20 topic word pairs using Eq. 15. Newman et al. (2010) observed that it is important to use an external data set to evaluate PMI. Therefore, we use a 20word sliding window in Wikipedia (Shaoul, 2010) to identify the co-occurrence of word pairs.
The experimental results are shown in Table 4. We can see that our MCTA model generally improves the coherence of the learned topics compared with PCLSA. The word-to-word bilingual constraints in PCLSA are not as effective. On the other hand, our MTCA model incorporates the bilingual translations using auxiliary distributions which incorporate word distributions from the other language on the topic level and can capture common semantics of multilingual reader comments.

Application and Case Study
We present an application for news comment digest and show some examples of detected culturalcommon discussion topics in Figure 3. Under each event, the system can group reader comments into cultural-common discussion topics which can capture common concerns of readers in different languages. For each common topic, it shows top ranked words and corresponding reader comments