A Context-Aware Topic Model for Statistical Machine Translation

Lexical selection is crucial for statistical machine translation. Previous studies separately exploit sentence-level contexts and document-level topics for lexical selection, neglecting their correlations. In this paper, we propose a context-aware topic model for lexical selection, which not only models local contexts and global topics but also captures their correlations. The model uses target-side translations as hidden variables to connect document topics and source-side local contextual words. In order to learn hidden variables and distributions from data, we introduce a Gibbs sampling algorithm for statistical estimation and inference. A new translation probability based on distributions learned by the model is integrated into a translation system for lexical selection. Experiment results on NIST Chinese-English test sets demonstrate that 1) our model signiﬁcantly outperforms previous lexical selection methods and 2) modeling correlations between local words and global topics can further improve translation quality.


Introduction
Lexical selection is a very important task in statistical machine translation (SMT). Given a sentence in the source language, lexical selection statistically predicts translations for source words, based on various translation knowledge. Most conventional SMT systems (Koehn et al., 2003;Galley et al., 2006;Chiang, 2007)  Previous studies that explore richer information for lexical selection can be divided into two categories: 1) incorporating sentence-level contexts (Chan et al., 2007;Carpuat and Wu, 2007;Hasan et al., 2008;Mauser et al., 2009;Shen et al., 2009) or 2) integrating document-level topics (Xiao et al., 2011;Ture et al., 2012;Xiao et al., 2012;Eidelman et al., 2012;Hewavitharana et al., 2013;Xiong et al., 2013;Hasler et al., 2014a;Hasler et al., 2014b) into SMT. The methods in these two strands have shown their effectiveness on lexical selection.
However, correlations between sentence-and document-level contexts have never been explored before. It is clear that local contexts and global topics are often highly correlated. Consider a Chinese-English translation example presented in Figure 1. On the one hand, if local contexts suggest that the source word "á|/lìchǎng" should be translated in-to "stance", they will also indicate that the topic of the document where the example sentence occurs is about politics. The politics topic can be further used to enable the decoder to select a correct translation "issue" for another source word "K /wèntǐ", which is consistent with this topic. On the other hand, if we know that this document mainly focuses on the politics topic, the candiate translation "stance" will be more compatible with the context of "á|/lìchǎng" than the candiate translation "attitude". This is because neighboring sourceside words "¥I/zhōnguó" and "¥á/zhōnglì" often occur in documents that are about international politics. We believe that such correlations between local contextual words and global topics can be used to further improve lexical selection.
In this paper, we propose a unified framework to jointly model local contexts, global topics as well as their correlations for lexical selection. Specifically, • First, we present a context-aware topic model (CATM) to exploit the features mentioned above for lexical selection in SMT. To the best of our knowledge, this is the first work to jointly model both local and global contexts for lexical selection in a topic model.
• Second, we present a Gibbs sampling algorithm to learn various distributions that are related to topics and translations from data. The translation probabilities derived from our model are integrated into SMT to allow collective lexical selection with both local and global informtion.
We validate the effectiveness of our model on a state-of-the-art phrase-based translation system. Experiment results on the NIST Chinese-English translation task show that our model significantly outperforms previous lexical selection methods.

Context-Aware Topic Model
In this section, we describe basic assumptions and elaborate the proposed context-aware topic model.

Basic Assumptions
In CATM, we assume that each source document d consists of two types of words: topical words which are related to topics of the document and contextual words which affect translation selections of topical words. As topics of a document are usually represented by content words in it, we choose source-side nouns, verbs, adjectives and adverbs as topical words. For contextual words, we use all words in a source sentence as contextual words. We assume that they are generated by target-side translations of other words than themselves. Note that a source word may be both topical and contextual. For each topical word, we identify its candidate translations from training corpus according to word alignments between the source and target language. We allow a target translation to be a phrase of length no more than 3 words. We refer to these translations of source topical words as target-side topical items, which can be either words or phrases. In the example shown in Figure  1, all source words within dotted boxes are topical words. Topical word "á|/lìchǎng" is supposed to be translated into a target-side topical item "stance", which is collectively suggested by neighboring contextual words " ¥ I/zhōngguó", "¥ á/zhōnglì" and the topic of the corresponding document.
In our model, all target-side topical items in a document are generated according to the following two assumptions: • Topic consistency assumption: All target-side topical items in a document should be consistent with the topic distribution of the document. For example, the translations "issue", "stance" tend to occur in documents about politics topic.
• Context compatibility assumption: For a topical word, its translation (i.e., the counterpart target-side topical item) should be compatible with its neighboring contextual words. For instance, the translation "stance" of "á|/lìchǎng" is closely related to contextual words "¥I/zhōnguó" and "¥á/zhōnglì".

Model
The graphical representation of CATM, which visualizes the generative process of training data D, is shown in Figure 2. Notations of CATM are presented in   ical words "¯K/wèntí", "á|/lìchǎng", and contextual word "¥á/zhōnglì" in the following steps: Step 1: The model generates a topic distribution for the corresponding document as {economy 0.25 , politics 0.75 }.
Step 2: Based on the topic distribution, we choose "economy" and "politics" as topic assignments for "¯K/wèntí" and "á|/lìchǎng" respectively; Then, according to the distributions of the two topics over target-side topical items, we generate target-side topical items "issue" and "stance"; Finally, according to the translation probability distributions of these two topical items over source-side topical words, we generate source-side topical words "¯K/wèntí" and "á|/lìchǎng" for them respectively.
In the above generative process, all target-side topical items are generated from the underlying topics of a source document, which guarantees that selected target translations are topic-consistent. Ad-ditionally, each source contextual word is derived from a target-side topical item given its generation probability distribution. This makes selected target translations also compatible with source-side local contextual words. In this way, global topics, topical words, local contextual words and target-side topical items are highly correlated in CATM that exactly captures such correlations for lexical selection.

Parameter Estimation and Inference
We propose a Gibbs sampling algorithm to learn various distributions described in the previous section. Details of the learning and inference process are presented in this section.

The Probability of Training Corpus
According to CATM, the total probability of training data D given hyperparameters α, β, γ and δ is computed as follows: (1) where f d andẽ d denote the sets of topical words and their target-side topical item assignments in document d, c d andẽ d are the sets of contextual words and their target-side topical item assignments in document d.

Parameter Estimation via Gibbs Sampling
The joint distribution in Eq. (1) is intractable to compute because of coupled hyperparameters and hidden variables. Following Han et al, (2012), we adapt the well-known Gibbs sampling algorithm (Griffiths and Steyvers, 2004) to our model. We compute the joint posterior distribution of hidden variables, denoted by P (z,ẽ,ẽ |D), and then use this distribution to 1) estimate θ, φ, ψ and ξ, and 2) predict translations and topics of all documents in D.
Specifically, we derive the joint posterior distribution from Eq. (1) as: P (z,ẽ,ẽ |D) ∝ P (z)P (ẽ|z)P (f|ẽ)P (ẽ |ẽ)P (c|ẽ ) (2) Based on the equation above, we construct a Markov chain that converges to P (z,ẽ,ẽ |D), where each state is an assignment of a hidden variable (including topic assignment to a topical word, target-side topical item assignment to a source topical or contextual word.). Then, we sequentially sample each assignment according to the following three conditional assignment distributions: 1. P (z i = z|z −i ,ẽ,ẽ , D): topic assignment distribution of a topical word given z −i that denotes all topic assignments but z i ,ẽ andẽ that are target-side topical item assignments. It is updated as follows: where the topic assignment to a topical word is determined by the probability that this topic appears in document d (the 1st term) and the probability that the selected itemẽ occurs in this topic (the 2nd term). 2. P (ẽ i =ẽ|z,ẽ −i ,ẽ , D): target-side topical item assignment distribution of a source topical word given the current topic assignments z, the current item assignments of all other topical wordsẽ −i , and the current item assignments of contextual wordsẽ . It is updated as follows: where the target-side topical item assignment to a topical word is determined by the probability that this item is from the topic z (the 1st term), the probability that this item is translated into the topical word f (the 2nd term) and the probability of contextual words within a w s word window centered at the topical word f , which influence the selection of the target-side topical itemẽ (the 3rd term). It is very important to note that we use a parallel corpus to train the model. Therefore we directly identify target-side topical items for source topical words via word alignments rather than sampling.
3. P (ẽ i =ẽ|z,ẽ,ẽ −i , D): target-side topical item assignment distribution for a contextual word given the current topic assignments z, the current item assignments of topical wordsẽ, and the current item assignments of all other contextual wordsẽ −i . It is updated as follows: where the target-side topical item assignment used to generate a contextual word is determined by the probability of this item being assigned to generate contextual words within a surface window of size w s (the 1st term) and the probability that contextual words occur in the context of this item (the 2nd term). In all above formulas, C DZ dz is the number of times that topic z has been assigned for all topical words in document d, C DZ d * = z C DZ dz is the topic number in document d, and C ZẼ zẽ , CẼ F ef , C WẼ wẽ , C WẼ wẽ and CẼ C ec have similar explanations. Based on the above marginal distributions, we iteratively update all assignments of corpus D until the constructed Markov chain converges. Model parameters are estimated using these final assignments.

Inference on Unseen Documents
For a new document, we first predict its topics and target-side topical items using the incremental Gibbs sampling algorithm described in (Kataria et al., 2011). In this algorithm, we iteratively update topic assignments and translation assignments of an unseen document following the same process described in Section 3.2, but with estimated model parameters.
Once we obtain these assignments, we estimate lexical translation probabilities based on the sampled counts of target-side topical items. Formally, for the position i in the document corresponding to the content word f , we collect the sampled count that translationẽ generates f , denoted by C sam (ẽ, f ). This count can be normalized to form a new translation probability in the following way: where C sam is the total number of samples during inference and Nẽ ,f is the number of candidate translations of f . Here we apply add-k smoothing to refine this translation probability, where k is a tunable global smoothing constant. Under the framework of log-linear model (Och and Ney, 2002), we use this translation probability as a new feature to improve lexical selection in SMT.

Experiments
In order to examine the effectiveness of our model, we carried out several groups of experiments on Chinese-to-English translation.

Setup
Our bilingual training corpus is from the FBIS corpus and the Hansards part of LDC2004T07 corpus (1M parallel sentences, 54.6K documents, with 25.2M Chinese words and 29M English words). We first used ZPar toolkit 2 and Stanford toolkit 3 to preprocess (i.e., word segmenting, PoS tagging) the Chinese and English parts of training corpus, and then word-aligned them using GIZA++ (Och and Ney, 2003) with the option "grow-diag-final-and". We chose the NIST evaluation set of MT05 as the development set, and the sets of MT06/MT08 as test sets. On average, these three sets contain 17.2, 13.9 and 14.1 content words per sentence, respectively. We trained a 5-gram language model on the Xinhua portion of Gigaword corpus using the SRILM Toolkit (Stolcke, 2002).
Our baseline system is a state-of-the-art SMT system, which adapts bracketing transduction grammars (Wu, 1997) to phrasal translation and equips itself with a maximum entropy based reordering model (MEBTG) (Xiong et al., 2006). We used the toolkit 4 developed by Zhang (2004) to train the reordering model with the following parameters: iteration number iter=200 and Gaussian prior g=1.0. During decoding, we set the ttable-limit as 20, the stack-size as 100. The translation quality is evaluated by case-insensitive BLEU-4 (Papineni et al., 2002) metric. Finally, we conducted paired bootstrap sampling (Koehn, 2004) to test the significance in BLEU score differences.  To train CATM, we set the topic number N z as 25. 5 For hyperparameters α and β, we empirically set α=50/N z and β=0.1, as implemented in (Griffiths and Steyvers, 2004). Following Han et al. (2012), we set γ and δ as 1.0/N f and 2000/N c , respectively. During the training process, we ran 400 iterations of the Gibbs sampling algorithm. For documents to be translated, we first ran 300 rounds in a burn-in step to let the probability distributions converge, and then ran 1500 rounds where we collected independent samples every 5 rounds. The longest training time of CATM is less than four days on our server using 4GB RAM and one core of 3.2GHz CPU. As for the smoothing constant k in Eq. (6), we set its values to 0.5 according to the performance on the development set in additional experiments.

Impact of Window Size w s
Our first group of experiments were conducted on the development set to investigate the impact of the window size w s . We gradually varied window size from 6 to 14 with an increment of 2.
Experiment results are shown in Table 2. We achieve the best performance when w s =12. This suggests that a ?12-word window context is sufficient for predicting target-side translations for ambiguous source-side topical words. We therefore set w s =12 for all experiments thereafter.

Overall Performance
In the second group of experiments, in addition to the conventional MEBTG system, we also compared CATM with the following two models: Word Sense Disambiguation Model (WSDM) (Chan et al., 2007). This model improves lexical selection in SMT by exploiting local contexts. For each content word, we construct a MaxEnt-based classifier incorporating local collocation and surrounding word features, which are also adopted by Chan et al. (2007). For each candidate translatioñ e of topical word f , we use WSDM to estimate the context-specific translation probability P (ẽ|f ), which is used as a new feature in SMT system.
Topic-specific Lexicon Translation Model (TLTM) (Zhao and Xing, 2007). This model focuses on the utilization of document-level context. We adapted it to estimate a lexicon translation probability as follows: where p(ẽ|f, z) is the lexical translation probability conditioned on topic z, which can be calculated according to the principle of maximal likelihood, p(f |z) is the generation probability of word f from topic z, and p(z|d) denotes the posterior topic distribution of document d.
Note that our CATM is proposed for lexical selection on content words. To show the strong effectiveness of our model, we also compared it against the full-fledged variants of the above-mentioned two models that are built for all source words. We refer to them as WSDM (All) and TLTM (All), respectively. Table 3 displays BLEU scores of different lexical selection models. All models outperform the baseline. Although we only use CATM to predict translations for content words, CATM achieves an average BLEU score of 26.77 on the two test sets, which is higher than that of the baseline by 1.18 BLEU points. This improvement is statistically significant at p<0.01. Furthermore, we also find that our model performs better than WSDM and TLTM with significant improvements. Finally, even if WSDM (All) and TLTM (all) are built for all source words, they are still no better than than CATM that selects desirable translations for content words. These experiment results strongly demonstrate the advantage of CATM over previous lexical selection models.

Analysis
In order to investigate why CATM is able to outperform previous models that explore only local contex-  Table 3: Experiment results on the test sets. Avg = average BLEU scores. WSDM (All) and TLTM (All) are models built for all source words. ↓: significantly worse than CATM (p<0.05), ↓↓: significantly worse than CATM (p<0.01) .
tual words or global topics, we take a deep look into topics, topical items and contextual words learned by CATM and empirically analyze the effect of modeling correlations between local contextual words and global topics on lexical selection.

Outputs of CATM
We present some examples of topics learned by CATM in Table 4. We also list five target-side topical items with the highest probabilities for each topic, and the most probable five contextual words for each target-side topical item. These examples clearly show that target-side topical items tightly connect global topics and local contextual words by capturing their correlations.

Effect of Correlation Modeling
Compared to previous lexical selection models, CATM jointly models both local contextual words and global topics. Such a joint modeling also enables CATM to capture their inner correlations at the model level. In order to examine the effect of correlation modeling on lexical selection, we compared CATM with its three variants: CATM (Context) that only uses local context information. We determined target-side topical items for content words in this variant by setting the probability distribution that a topic generates a target-side topical item to be uniform; CATM (Topic) that explores only global topic information. We identified target-side topical items for content words in the model by setting w s as 0, i.e., no local contextual words being used at all.
CATM (Log-linear) is the combination of the above-mentioned two variants ( and ) in a log-linear manner, which does not capture correlations between local contextual words and global topics at the model level.  Results in Table 5 show that CATM performs significantlly better than both CATM (Topic) and CAT-M (Context). Even compared with CATM (Loglinear), CATM still achieves a significant improvement of 0.35 BLEU points (p<0.05). This validates the effectiveness of capturing correlations for lexical selection at the model level.

Related Work
Our work is partially inspired by (Han and Sun, 2012), where an entity-topic model is presented for entity linking. We successfully adapt this work to lexical selection in SMT. The related work mainly includes the following two strands.
(1) Lexical Selection in SMT. In order to explore rich context information for lexical selection, some researchers propose trigger-based lexicon models to capture long-distance dependencies (Hasan et al., 2008;Mauser et al., 2009), and many more researchers build classifiers to select desirable translations during decoding (Chan et al., 2007;Carpuat and Wu, 2007;. Along this line, Shen et al. (2009)

cross-strait relation
Taiwan Here "q" and "|" are Chinese quantifiers for missile and war, respectively; "ü" and "W" together means cross-starit.
the document-level translation consistency. Ture et al. (2012) soften this consistency constraint by integrating three counting features into decoder. Also relevant is the work of Xiong et al.(2013), who use three different models to capture lexical cohesion for document-level SMT.
(2) SMT with Topic Models. In this strand, Xing (2006, 2007) first present a bilingual topical admixture formalism for word alignment in SMT. Tam et al. (2007) and Ruiz et al. (2012) apply topic model into language model adaptation. Su et al. (2012) conduct translation model adaptation with monolingual topic information. Gong et al. (2010) and Xiao et al. (2012) introduce topic-based similarity models to improve SMT system. Axelrod et al. (2012) build topic-specific translation models from the TED corpus and select topic-relevant data from the UN corpus to improve coverage. Eidelman et al. (2012) incorporate topic-specific lexical weights into translation model. Hewavitharana et al. (2013) propose an incremental topic based translation model adaptation approach that satisfies the causality constraint imposed by spoken conversations.  present a new bilingual variant of LDA to compute topic-adapted, probabilistic phrase translation features. They also use a topic model to learn latent distributional representations of different context levels of a phrase pair (Hasler et al., 2014b).
In the studies mentioned above, those by Zhao and Xing (2006), Zhao and Xing (2007), Hasler et al. (2014a), andHasler et al. (2014b) are most related to our work. However, they all perform dynamic translation model adaptation with topic models. Significantly different from them, we propose a new topic model that exploits both local contextual words and global topics for lexical selection. To the best of our knowledge, this is first attempt to capture correlations between local words and global topics for better lexical selection at the model level.

Conclusion and Future Work
This paper has presented a novel context-aware topic model for lexical selection in SMT. Jointly modeling local contexts, global topics and their correlations in a unified framework, our model provides an effective way to capture context information at different levels for better lexical selection in SMT. Experiment results not only demonstrate the effectiveness of the proposed topic model, but also show that lexical selection benefits from correlation modeling.
In the future, we want to extend our model from the word level to the phrase level. We also plan to 236 improve our model with monolingual corpora.