Agreement-based Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora

We introduce an agreement-based approach to learning parallel lexicons and phrases from non-parallel corpora. The basic idea is to encourage two asymmetric latent-variable translation models (i.e., source-to-target and target-to-source) to agree on identifying latent phrase and word alignments. The agreement is defined at both word and phrase levels. We develop a Viterbi EM algorithm for jointly training the two unidirectional models efficiently. Experiments on the Chinese-English dataset show that agreement-based learning significantly improves both alignment and translation performance.


Introduction
Parallel corpora, which are large collections of parallel texts, serve as an important resource for inducing translation correspondences, either at the level of words (Brown et al., 1993;Smadja and McKeown, 1994;Wu and Xia, 1994) or phrases (Kupiec, 1993;Melamed, 1997;Marcu and Wong, 2002;Koehn et al., 2003).However, the availability of large-scale, wide-coverage corpora still remains a challenge even in the era of big data: parallel corpora are usually only existent for resourcerich languages and restricted to limited domains such as government documents and news articles.
Recently, a number of authors have turned to a more challenging task: learning parallel phrases from non-parallel corpora (Zhang and Zong, 2013;Dong et al., 2015).Zhang and Zong (2013) present a method for retrieving parallel phrases from non-parallel corpora using a seed parallel lexicon.Dong et al. (2015) continue this line of research to further introduce an iterative approach to joint learning of parallel lexicons and phrases.They introduce a corpus-level latentvariable translation model in a non-parallel scenario and develop a training algorithm that alternates between (1) using a parallel lexicon to extract parallel phrases from non-parallel corpora and (2) using the extracted parallel phrases to enlarge the parallel lexicon.They show that starting from a small seed lexicon, their approach is capable of learning both new words and phrases gradually over time.
However, due to the structural divergence between natural languages as well as the presence of noisy data, only using asymmetric translation models might be insufficient to accurately identify parallel lexicons and phrases from non-parallel corpora.Dong et al. (2015) report that the accuracy on Chinese-English dataset is only around 40% after running for 70 iterations.In addition, their approach seems prone to be affected by noisy data in non-parallel corpora as the accuracy drops significantly with the increase of noise.
Since asymmetric word alignment and phrase alignment models are usually complementary, it is natural to combine them to make more accurate predictions.In this work, we propose to in-troduce agreement-based learning (Liang et al., 2006;Liang et al., 2008) into extracting parallel lexicons and phrases from non-parallel corpora.Based on the latent-variable model proposed by Dong et al. (2015), we propose two kinds of loss functions to take into account the agreement between both phrase alignment and word alignment in two directions.As the inference is intractable, we resort to a Viterbi EM algorithm to train the two models efficiently.Experiments on the Chinese-English dataset show that agreementbased learning is more robust to noisy data and leads to substantial improvements in phrase alignment and machine translation evaluations.

Background
Given a monolingual corpus of source language phrases E = {e (s) } S s=1 and a monolingual corpus of target language phrases F = {f (t) } T t=1 , we assume there exists a parallel corpus D = { e (s) , f (t) |e (s) ↔ f (t) }, where e (s) ↔ f (t) denotes that e (s) and f (t) are translations of each other.
As a long sentence in E is usually unlikely to have an translation in F and vise versa, most previous efforts build on the assumption that phrases are more likely to have translational equivalents on the other side (Munteanu and Marcu, 2006;Cettolo et al., 2010;Zhang and Zong, 2013;Dong et al., 2015).Such a set of phrases can be constructed by collecting either constituents of parsed sentences or strings with hyperlinks on webpages (e.g., Wikipedia).Therefore, we assume the two monolingual corpora are readily available and focus on how to extract D from E and F .
To address this problem, Dong et al. ( 2015) introduce a corpus-level latent-variable translation model in a non-parallel scenario: where m is phrase alignment and θ is a set of model parameters.Each target phrase f (t)  is restricted to connect to exactly one source phrase: m = (m 1 , . . ., m t , . . .m T ), where m t ∈ {0, 1, . . ., S}.For example, m t = s denotes that f (t) is aligned to e (s) .Note that e (0) represents an empty source phrase.They follow IBM Model 1 (Brown et al., 1993) to further decompose the model as where P (f (t) |e (mt) ; θ) is a phrase translation model that can be further defined as t) , a|e (mt) ; θ) word alignment .
(3) Dong et al. (2015) distinguish between empty and non-empty phrase translations.If a target phrase f (t) is aligned to the empty source phrase e (0) (i.e., m t = 0), they set the phrase translation probability to a fixed number .Otherwise, conventional word alignment models such as IBM Model 1 can be used for non-empty phrase translation: where p(J|I) is a length model and p(f |e) is a translation model.We use J (t) to denote the length of f (t) .Therefore, the latent-variable model involves two kinds of latent structures: (1) phrase alignment m between source and target phrases, (2) word alignment a between source and target words within phrases.
Given the two monolingual corpora E and F , the training objective is to maximize the likelihood of the training data: where  The outer agreement loss function (see Eq. ( 14)) aims to encourage the agreement at the phrase level.
Note that d is a small seed parallel lexicon for initializing training1 and σ(f, e, d) checks whether an entry f, e exists in d.
Given the monolingual corpora and the optimized model parameters, the Viterbi phrase alignment is calculated as Finally, parallel lexicons can be derived from the translation probability table of IBM model 1 θ * and parallel phrases can be collected from the Viterbi phrase alignment m * .This process iterates and enlarges parallel lexicons and phrases gradually over time.
As it is very challenging to extract parallel phrases from non-parallel corpora, unidirectional models might only capture partial aspects of translation modeling on non-parallel corpora.Indeed, Dong et al. (2015) find that the accuracy of phrase alignment is only around 50% on the Chinese-English dataset.More importantly, their approach seems to be vulnerable to noise as the accuracy drops significantly with the increase of noise.As source-to-target and target-to-source translation models are usually complementary (Och and Ney, 2003;Koehn et al., 2003;Liang et al., 2006), it is appealing to combine them to improve alignment accuracy.

Agreement-based Learning
The basic idea of our work is to encourage the source-to-target and target-to-source translation models to agree on both phrase and word alignments.
For example, Figure 1 shows two example Chinese-to-English and English-to-Chinese phrase alignments on the same non-parallel data.As each model only captures partial aspects of translation modeling, our intuition is that the links on which two models agree (highlighted in red) are more likely to be correct.
More formally, let P (F |E; − → θ ) be a sourceto-target translation model and P (E|F ; To ease the comparison between − → m and ← − m, we represent them as sets of non-empty links equivalently: For example, suppose the source-to-target and target-to-source phrase alignments are − → m = Initialize Θ (0) 3: end for 7: return m(K) , Θ (K) 8: end procedure is the set of model parameters at the k-th iteration, m(k) is the Viterbi phrase alignment on which two models agree at the k-th iteration.
(2, 3, 0, 0) and ).Following Liang et al. (2006), we introduce a new training objective that favors the agreement between two unidirectional models: where the posterior probabilities in two directions are defined as The measures the disagreement between the two models.

Definition
A straightforward loss function is to force the two models to generate identical phrase alignments: We refer to Eq. ( 14) as outer agreement since it only considers phrase alignment and ignores the word alignment within aligned phrases.

Training Objective
Since the outer agreement forces two models to generate identical phrase alignments, the training objective can be written as where m is a phrase alignment on which two models agree.
The partial derivatives of the training objective with respect to source-to-target model parameters − → θ are given by The partial derivatives with respect to ← − θ are defined likewise.

Training Algorithm
As the expectation in Eq. ( 16) is usually intractable to calculate due to the exponential search space of phrase alignment, we follow Dong et al. (2015) to use a Viterbi EM algorithm instead.
As shown in Figure 2, the algorithm takes a set of source phrases E, a set of target phrases F , and a seed parallel lexicon d as input (line 1).After initializing model parameters ), the algorithm calls the procedure ALIGN(F, E, Θ) to compute the Viterbi phrase alignment between E and F on which two models agree.Then, the algorithm updates the two models by normalizing counts collected from the Viterbi phrase alignment.The process iterates for K iterations and returns the final Viterbi phrase alignment and model parameters.

Computing Viterbi Phrase Alignments
The procedure ALIGN(F, E, Θ) computes the Viterbi phrase alignment m between E and F on which two models agree as follows: Unfortunately, due to the exponential search space of phrase alignment, computing m is also intractable.As a result, we approximate it as the intersection of two unidirectional Viterbi phrase alignments: where the unidirectional Viterbi phrase alignments are calculated as The source-to-target Viterbi phrase alignment is calculated as Dong et al. (2015) indicate that computing the Viterbi alignment for individual target phrases is independent and only need to focus on finding the most probable source phrase for each target phrase: This can be cast as a translation retrieval problem (Zhang and Zong, 2013;Dong et al., 2014).Please refer to (Dong et al., 2015) for more details.The target-to-source Viterbi phrase alignment can be calculated similarly.

Updating Model Parameters
Following Liang et al. (2006), we collect counts of model parameters only from the agreement term. 2  Given the agreed Viterbi phrase alignment m, the count of the source-to-target length model p(J|I) is given by c(J|I; E, F ) = s,t ∈ m δ(J (t) , J)δ(I (s) , I). (24) The new length probabilities can be obtained by 2 We experimented with collecting counts from both the unidirectional and agreement terms but obtained much worse results than counting only from the agreement term.Figure 3: Agreement between (a) Chinese-to-English and (b) English-to-Chinese word alignments.The links on which two models agree are highlighted in red.The inner agreement loss function (see Eq. ( 28)) aims to encourage the agreement at both the phrase and word levels.
The count of the source-to-target translation model p(f |e) is given by The new translation probabilities can be obtained by Counts of target-to-source length and translation models can be calculated in a similar way.

Definition
As the outer agreement only considers the phrase alignment, the inner agreement takes both phrase alignment and word alignment into consideration: For example, Figure 3 shows two examples of Chinese-to-English and English-to-Chinese word alignments.The shared links are highlighted in red.Our intuition is that a source phrase and a target phrase are more likely to be translations of each other if the two translation models also agree on word alignment within aligned phrases.

Training Objective and Algorithm
The training objective for inner agreement is given by We still use the Viterbi EM algorithm as shown in Figure 2 for training the two models.

Computing Viterbi Phrase Alignments
The agreed Viterbi phrase alignment is defined as As computing m is intractable, we still approximate it using the intersection of two unidirectional Viterbi phrase alignments (see Eq. ( 18)).The source-to-target Viterbi phrase alignment is calculated as where P ( i, j |e (s) , f (t) ; − → θ ) is source-to-target link posterior probability of the link i, j being present (or absent) in the word alignment according to the source-to-target model, P ( i, j |f (t) , e (s) ; ← − θ ) is target-to-source link posterior probability.We follow Liang et al. (2006) to use the product of link posteriors to encourage the agreement at the level of word alignment.
We use a coarse-to-fine approach (Dong et al., 2015) to compute the Viterbi alignment: first retrieving a coarse set of candidate source phrases using translation probabilities and then selecting the candidate with the highest score according to Eq. ( 31).The target-to-source Viterbi phrase alignment can be calculated similarly.

Updating Model Parameters
Given the agreed Viterbi phrase alignment m, the count of the source-to-target length model p(J|I) is still given by Eq. ( 24).The count of the translation model p(f |e) is calculated as Counts of target-to-source length and translation models can be calculated in a similar way.

Experiments
In this section, we evaluate our approach in two tasks: phrase alignment (Section 4.1) and machine translation (Section 4.2).

Evaluation Metrics
Given two monolingual corpora E and F , we suppose there exists a ground truth parallel corpus G and denote an extracted parallel corpus as D. The quality of an extracted parallel corpus can be measured by F1 = 2|D ∩ G|/(|D| + |G|).

Data Preparation
Although it is appealing to apply our approach to dealing with real-world non-parallel corpora, it is time-consuming and labor-intensive to manually construct a ground truth parallel corpus.Therefore, we follow Dong et al. (2015) to build synthetic E, F , and G to facilitate the evaluation.
We first extract a set of parallel phrases from a sentence-level parallel corpus using the stateof-the-art phrase-based translation system Moses (Koehn et al., 2007) and discard low-probability parallel phrases.Then, E and F can be constructed by corrupting the parallel phrase set by adding irrelevant source and target phrases randomly.Note that the parallel phrase set can serve as the ground truth parallel corpus G.We refer to the non-parallel phrases in E and F as noise.
From LDC Chinese-English parallel corpora, we constructed a development set and a test set.The development set contains 20K parallel phrases, 20K noisy Chinese phrases, and 20K noisy English phrases.The test test contains 20K parallel phrases, 180K noisy Chinese phrases, and 180K noisy English phrases.The seed parallel lexicon contains 1K entries.

Comparison of Agreement Ratios
We introduce agreement ratio to measure to what extent two unidirectional models agree on phrase alignment: Figure 4 shows the agreement ratios of independent training ("no agreement"), joint training with the outer agreement ("outer"), and joint training with the inner agreement ("inner").We find that independently trained unidirectional models hardly agree on phrase alignment, suggesting that each model can only capture partial aspects of translation modeling on non-parallel corpora.In contrast, imposing the agreement term significantly increases the agreement ratios: after 10 iterations, about 40% of phrase alignment links are shared by two models.

Effect of Seed Lexicon Size
Table 1 shows the F1 scores of the Chinese-to-English model ("C → E"), the English-to-Chinese model ("E → C"), joint learning based on the outer agreement ("outer"), and jointing learning based on the inner agreement ("inner") over various sizes of seed lexicons on the development set.We find that agreement-based learning obtains substantial improvements over independent learning across all sizes.More importantly, even with a seed lexicon containing only 50 entries, agreement-based learning is able to achieve F1 scores above 60%.The inner agreement performs better than the outer agreement by taking the consensus at the word level into account.

Effect of Noise
Table 2 demonstrates the effect of noise on the development set.In row 1, "0+0" denotes there is no noise, which can be seen as an upper bound.Adding noise, either on the Chinese side or on the English side, deteriorates the F1 scores for all methods.Adding noise on the English side makes predicting phrase alignment in the C → E direction more challenging due to the enlarged search space.The situation is similar in the reverse direction.It is clear that agreement-based learning is more robust to noise: while independent training suffers from a reduction of 40% in terms of F1 for the "20K + 20K" setting, agreement-based learning still achieves F1 scores over 70%.

Results
Figure 5 gives the final results on the test set.We find that agreement-based training achieves significant improvements over independent training.By considering the consensus on both phrase and word alignments, the inner agreement significantly outperforms the outer agreement.Notice that Dong et al. (2015) only add noise on one side while we add noisy phrases on both sides, which makes phrase alignment more challenging.
Table 3 shows example learned parallel words and phrases.The lexicon is built from the translation table by retaining high-probability word pairs.Therefore, our approach is capable of learning both new words and new phrases unseen in the seed lexicon.

Translation Evaluation
Following Zhang and Zong (2013) and Dong et al. (2015), we evaluate our approach on domain adaptation for machine translation.
The data set consists of two in-domain nonparallel corpora and an out-domain parallel corpus.The in-domain non-parallel corpora consists of 2.65M Chinese phrases and 3.67M English phrases extracted from LDC news articles.We use a small out-domain parallel corpus extracted from financial news of FTChina which contains 10K phrase pairs.The task is to extract a parallel corpus from in-domain non-parallel corpora starting from a small out-domain parallel corpus.
We use the state-of-the-art translation system Moses (Koehn et al., 2007) and evaluate the performance on Chinese-English NIST datasets.The development set is NIST 2006 and the test set is NIST 2005.The evaluation metric is caseinsensitive BLEU4 (Papineni et al., 2002).We use the SRILM toolkit (Stolcke, 2002) to train a 4-gram English language model on a monolingual corpus with 399M English words.
Table 4 shows the results.At iteration 0, only the out-domain corpus is used and the BLEU score is 5.61.All methods iteratively extract parallel phrases from non-parallel corpora and enlarge the extracted parallel corpus.We find that agreementbased learning achieves much higher BLEU scores while obtains a smaller parallel corpus as compared with independent learning.One possible reason is that the agreement-based learning rules out most unlikely phrase pairs by encouraging consensus between two models.

Conclusion
We have presented agreement-based training for learning parallel lexicons and phrases from nonparallel corpora.By modeling the agreement on both phrase alignment and word alignment, our approach achieves significant improvements in both alignment and translation evaluations.
In the future, we plan to apply our approach to real-world non-parallel corpora to further verify its effectiveness.It is also interesting to extend the phrase translation model to more sophisticated models such as IBM models 2-5 (Brown et al., 1993) andHMM (Vogel andNey, 1996)

Figure 1 :
Figure 1: Agreement between (a) Chinese-to-English and (b) English-to-Chinese phrase alignments.The arrows indicate translation directions.The links on which two models agree are highlighted in bold red.The outer agreement loss function (see Eq. (14)) aims to encourage the agreement at the phrase level.

Figure 2 :
Figure 2: A Viterbi EM algorithm for agreementbased learning of parallel lexicons and phrases from non-parallel corpora.F and E are nonparallel corpora, d is a seed parallel lexicon, Θ (k)is the set of model parameters at the k-th iteration, m(k) is the Viterbi phrase alignment on which two models agree at the k-th iteration.

Figure 5 :
Figure 5: Comparison of F1 scores on the test set.

Table 1 :
Effect of seed lexicon size in terms of F1 on the development set.

Table 2 :
Effect of noise in terms of F1 on the development set.

Table 3 :
Example learned parallel lexicons and phrases.New words that are not included in the seed lexicon are highlighted in italic.

Table 4 :
. Results on domain adaptation for machine translation.ported by the National Natural Science Foundation of China (No. 61522204), the 863 Program (2015AA011808), and Samsung R&D Institute of China.Huanbo Luan is supported by the National Natural Science Foundation of China (No. 61303075).Maosong Sun is supported by the Major Project of the National Social Science Foundation of China (13&ZD190).