Leave-one-out Word Alignment without Garbage Collector Effects

Expectation-maximization algorithms, such as those implemented in GIZA++ pervade the ﬁeld of unsupervised word alignment. However, these algorithms have a problem of over-ﬁtting, leading to “garbage collector effects,” where rare words tend to be erroneously aligned to untranslated words. This paper proposes a leave-one-out expectation-maximization algorithm for unsupervised word alignment to address this problem. The proposed method excludes information derived from the alignment of a sentence pair from the alignment models used to align it. This prevents erroneous alignments within a sentence pair from supporting themselves. Experimental results on Chinese-English and Japanese-English corpora show that the F 1 , precision and recall of alignment were consistently increased by 5.0% – 17.2%, and BLEU scores of end-to-end translation were raised by 0.03 – 1.30. The proposed method also outperformed l 0 -normalized GIZA++ and Kneser-Ney smoothed GIZA++.


Introduction
Unsupervised word alignment (WA) on bilingual sentence pairs serves as an essential foundation for building most statistical machine translation (SMT) systems. A lot of methods have been proposed to raise the accuracy of WA in an effort to improve end-to-end translation quality. This paper contributes to this effort through refining the widely used expectation-maximization (EM) algorithm for WA (Dempster et al., 1977;Brown et al., 1993b;Och and Ney, 2000). * The author now is affiliated with Google, Japan.
However, the EM algorithm for WA is wellknown for introducing "garbage collector effects." Rare words have a tendency to collect garbage, that is they have a tendency to be erroneously aligned to untranslated words (Brown et al., 1993a;Moore, 2004;Ganchev et al., 2008;V Graça et al., 2010). Figure 1(a) shows a real sentence pair, denoted s, from the GALE Chinese-English Word Alignment and Tagging Training corpus (GALE WA corpus) 1 with it's humanannotated word alignment. The Chinese word "HE ZHANG," denoted w r , which means river custodian, only occurs once in the whole corpus. We performed EM training using GIZA++ on this corpus concatenated with 442,967 training sentence pairs from the NIST Open Machine Translation (OpenMT) 2006 evaluation 2 . The resulting alignment is shown in Figure 1(b). It can be seen that w r is erroneously aligned to multiple English words.
To find the cause of this, we checked the alignments in each iteration i of s, denoted a i s . We found that in a 1 s , w r together with the other source-side words were aligned with uniform probability to all the target-side words since the alignment models provided no prior information. However, in a 2 s , w r became erroneously aligned, 1 Released by Linguistic Data Consortium, catalog number LDC2012T16, LDC2012T20, LDC2012T24 and LDC2013T05.
2 http://www.itl.nist.gov/iad/mig/ tests/mt/2006/ because the alignment distribution 3 of w r was only learned from a 1 s , thus consisted of non-zero values only for generating the target-side words in s. Therefore, the alignment probabilities from the rare word w r to the unaligned words in s were extraordinarily high, since almost all of the probability mass was distributed among them. In other words, the story behind these garbage collector effects is that erroneous alignments are able to provide support for themselves; the probability distribution learned only from s is re-applied to s. In this way, these "garbage collector effects" are a form of over-fitting.
Motivated by this observation, we propose a leave-one-out EM algorithm for WA in this paper. Recently this technique has been applied to avoid over-fitting in kernel density estimation (Roux and Bach, 2011); instead of performing maximum likelihood estimation, maximum leaveone-out likelihood estimation is performed. Figure 1(c) shows the effect of using our technique on the example. The garbage collection has not occurred, and the alignment of the word "HE ZHANG" is identical to the human annotation.

Related Work
The most related work to this paper is training phrase translation models with leave-one-out forced alignment (Wuebker et al., 2010;Wuebker et al., 2012). The differences are that their work operates at the phrase level, and their aim is to improve translation models; while our work operates at the word level, and our aim is to provide better word alignment. As word alignment is a foundation of most MT systems, our method have a wider application.
Recently, better estimation methods during the maximization step of EM have been proposed to avoid the over-fitting in WA, such as using Kneser-Ney Smoothing to back-off the expected counts (Zhang and Chiang, 2014) or integrating the smoothed l 0 prior to the estimation of probability (Vaswani et al., 2012). Our work differs from theirs by addressing the over-fitting directly in the EM algorithm by adopting a leave-one-out approach.
Bayesian methods (Gilks et al., 1996;Andrieu et al., 2003;DeNero et al., 2008;  2011), also attempt to address the issue of overfitting, however EM algorithms related to the proposed method have been shown to be more efficient (Wang et al., 2014).

Methodology
This section first formulates the standard EM for WA, then presents the leave-one-out EM for WA, and finally briefly discusses handling singletons and effecient implementation. The main notation used in this section is shown in Table 1.

Standard EM for IBM Models 1, 2 and HMM Model
To perform WA through EM, the parallel corpus is taken as observed data, the alignments are taken as latent data. In order to maximize the likelihood of the alignment model θ given the data S, the following two steps are conducted iteratively (Brown et al., 1993b;Och and Ney, 2000;Och and Ney, 2003), Expectation Step (E step): calculating the conditional probability of alignments for each sentence pair, where θ ali (i|i ′ , I) is the alignment probability and θ lex (f |e) is the translation probability. Note that f a foreign sentence (f 1 , . . . , f J ) e an English sentence (e 1 , . . . , e I ) s a sentence pair (f , e) a an alignment (a 1 , . . . , a J ) where f j is aligned to e a j B i a list of the indexes of the foreign words which are aligned to e i B i,k the index of the k-th foreign word which is aligned to e i B i is the average of all elements in B i ρ i the largest index of an English word s.t. ρ i < i and |B ρ i | > 0 φ i the fertility of e i E i the word class of e i θ · an probabilistic model θs · a leave-one-out probabilistic model for s nx(s, a) the number of times that an event x happens in (s, a) N x (s) the marginal number of times that an event x happens in s Table 1: Main Notation. Note that N x (s) = a n x (s, a)P (a|s). In practical calculation, for IBM models 1, 2 and HMM model, this summation is performed by dynamic programming; for IBM model 4, it is performed approximately using the best alignment and its neighbors.
(1) is a general form for IBM model 1, model 2 and the HMM model.
Maximization step (M step): re-estimating the probability models, where N i ′ ,I (s) is the marginal number of times e i ′ is aligned to some foreign word if the length of e is I, or 0 otherwise; N i|i ′ ,I (s) is the marginal number of times the next alignment position after i ′ is i in a if the length of e is I, or 0 otherwise; n e (s) is the count of e in e; N f |e (s, a) is the marginal number of times e is aligned to f .

Leave-one-out EM for IBM Models 1, 2 and HMM Model
Leave-one-out EM for WA differs from standard EM in the way the alignment and translation probabilities are calculated. Each sentence pair will have its own alignment and translation probability models calculated by excluding the sentence pair itself. More formally, leave-one-out EM for WA are formulated as follows, Leave-one-out E step: employing leave-oneout models for each s to calculate the conditional probability of alignments where θs ali (i|i ′ , I) and θs lex (f j |e a j ) are the leaveone-out alignment probability and translation probability, respectively.
Leave-one-out M step: re-estimating leaveone-out probability models,

Standard EM for IBM Model 4
The framework of the standard EM for IBM Model 4 is similar with the one for IBM Models 1, 2 and HMM Model, but the calculation of alignment probability is more complicated. E step: calculating the conditional probability through the reverted alignment (Och and Ney, 2003), where B 0 means the set of foreign words aligned with the empty word; P (B 0 |B 1 , . . . , B I ) is assumed to be a binomial distribution for the size of B 0 (Brown et al., 1993b) or an modified distribution to relieve deficiency (Och and Ney, 2003).
where θ fer is a fertility model; θ hea is a probability model for the head (first) aligned foreign word; θ oth is a probability model for the other aligned foreign words. θ hea is assumed to be conditioned on the word class E ρ i , following the paper of (Och and Ney, 2003) and the implementation of GIZA++ and CICADA.
M step: re-estimating the probability models, where ∆i is a difference of the indexes of two foreign words.

Leave-one-out EM for IBM Model 4
The leave-one-out treatment were applied to the three component probability models θ fer , θ hea and θ oth of IBM model 4.
Leave-one-out E step: calculating the conditional probability through leave-one-out probability models Leave-one-out M step: re-estimating the leaveone-out probability models,

Handling Singletons
Singletons are the words that occur only once in corpora. Singletons cause problems when applying leave-one-out to lexicalized models such as the translation model θs lex and the fertility model θs fer . When calculating (6) and (14) for singletons, the denominators become zero, thus the probabilities are undefined. For singletons, there is no prior information to guide their alignment, so we back off to uniform distributions. In that case, the alignments are primarily determined by the rest of the sentence.
In addition, singletons can be in the target side of the translation model θs lex . In that case, the probabilities become zero. This is handled by setting a minimum probability value of 1.0 × 10 −12 , which was decided by pilot experiments.

Implementation Details
To alleviate memory requirements and increase speed, our implementation did not build or store the local alignment models explicitly for each sentence pair. The following formula was used to efficiently calculate (5), (6) and (14-16) to build temporary probability models, where x is a alignment event. Our implementation maintained global counts of all alignment events s ′ N x (s ′ ), and (considerably smaller) local counts N x (s) from each sentence pair s. Take the translation model θs lex for example. For a sentence pair s = (f 1 . . . f J , e 1 . . . e I ), it is cauclulated as, The global counts to be maintained are and n e i (s ′ ), and the local counts are s N (f j |e i ) (s) and n e i (s). Therefore the memory cost is, where |E| is the size of English vocabulary, |F| is the size of foreign language vocabulary, I s is the length of the English sentence of s, and J s is the length of the foreign sentence of s. The calculation of the leave-one-out translation model is performed for each English word and foreign word in s. Therefore, the time cost is, s I s (J s + 1).
In addition, because the local counts N (f j |e i ) (s) and n e i (s) are read in order, storing them in a external memory such as a hard disk will not slow down the running speed much. This will reduce the memory cost to |E| · (|F| + 1).
This cost is independent to the number of sentence pairs 4 . The speed of the proposed method can be boosted through parallelism. These calculations on each sentence pair can be performed independently. We found empirically that when our implementation of the proposed method is run on a 16-core computer, it finishes the task earlier than GIZA++ 5 .

Experiments
The proposed WA method was tested on two language pairs: Chinese-English and Japanese-English (Table 2). Performance was measured both directly using the agreement with reference to manual WA annotations, and indirectly using the BLEU score in end-to-end machine translation tasks. GIZA++ and our own implementation of standard EM were used as baselines.

Experimental Settings
The Chinese-English experimental data consisted of the GALE WA corpus and the OpenMT corpus. They are from the same domain, both contain newswire texts and web blogs. The OpenMT evaluation 2005 was used as a development set for MERT tuning (Och, 2003), and the OpenMT evaluation 2006 was used as a test set. The Japanese-English experimental data was the Kyoto Free Translation Task (Neubig, 2011) 6 . The corpus contains a set of 1,235 sentence pairs that are manually word aligned.
The corpora were processed using a standard procedure for machine translation. The English texts were tokenized with the tokenization script released with Europarl corpus (Koehn, 2005) and converted to lowercase; the Chinese texts were segmented into words using the Stanford Word Segmenter (Xue et al., 2002) 7 ; the Japanese texts were segmented into words using the Kyoto Text Analysis Toolkit (KyTea 8 ). Sentences longer than 100 words or those with foreign/English word length ratios between larger than 9 were filtered out. GIZA++ was run with the default Moses settings (Koehn et al., 2007). The IBM model 1, HMM model, IBM model 3 and IBM model 4 were run with 5, 5, 3 and 3 iterations. We implemented the proposed leave-one-out EM and standard EM in IBM model 1, HMM model and IBM model 4. In the original work (Och and Ney, 2003) this combination of models achieved comparable performance to the default Moses settings. They were run with 5, 5 and 6 iterations.
The standard EM was re-implemented as a baseline to provide a solid basis for comparison, because GIZA++ contains many undocumented details. Our implementation is based on the toolkit of CICADA Watanabe, 2012;Tamura et al., 2013) 9 . We named the implemented aligner AGRIPPA, to support our inhouse decoders OCTAVIAN and AUGUSTUS.
In all experiments, WA was performed independently in two directions: from foreign languages to English, and from English to foreign languages. Then the grow-diag-final-and heuristic was used to combine the two alignments from both directions to yield the final alignments for evaluation (Och and Ney, 2000;Och and Ney, 2003).

Word Alignment Accuracy
Word alignment accuracy of the baseline and the proposed method is shown in Table 3 in terms of precision, recall and F 1 (Och and Ney, 2003). The proposed method gave rise to higher quality alignments in all our experiments. The improvement in F 1 , precision and recall based on IBM Model 4 is in the range 8.3% to 9.1% compared with the GIZA++ baseline, and in the range 5.0% to 17.2% compared with our own baseline.
The most meaningful result comes from the comparison of the models trained using standard EM log-likelihood training, and the proposed EM leave-one-out log-likelihood training. These models are identical except for way in which the model likelihood is calculated. In all our experiments the proposed method gave rise to higher quality alignments. The standard EM implementation achieved
alignment performance approximately comparable to GIZA++, whereas the proposed method exceeded the performance of both implementations.

End-to-end Translation Quality
BLEU scores achieved by the phrase-based and hierachical SMT systems 10 which were trained from different alignment results, are shown in Table 4. Each experiment was conducted three times to mitigate the variance in the results due to MERT. The results show that the proposed alignment method achieved the highest BLEU score in all experiments. The improvement over the baseline is in range 0.03 to 1.03 for phrase-based systems, and ranged from 0.43 to 1.30 for hierarchical systems.
Hierarchical systems benifit more from the proposed method than phrase-based systems. We think this is because that hierarchical systems are more sensitive to word alignment quality than phrase-based systems. Phrase-based systems only 10 from the Moses toolkit   Table 4: End-to-end translation quality measured by BLEU Corpus size standard EM (GIZA++) standard EM (ours) Leave-one-out(prop.)   Table 6: Effect of training corpus size on end-to-end translation quality measured by BLEU (Chinese-English). † the whole manually word aligned corpus take contiguous parallel phrase pairs as translation rules, while hierarchical systems also use patterns made by subtracting (inner) short parallel phrases from (outer) longer parallel phrases. Both the outer and inner phrases typically need to be noisefree in order to produce high quality rules. This puts a high demand on the alignment quality.

Effect of Training Corpus Size
Training corpora of different sizes were employed to perform unsupervised WA experiments and MT experiments (see Tables 5 and 6). The training corpora were randomly sampled from the Chinese-English manual WA corpora and the parallel training corpus. The manual WA corpus has a priority for being sampled so that the gold WA annotation is available for MT experi- ments.
The settings of the unsupervised WA experiments and the MT experiments are the same with the previous experiments. In the WA experiments, GIZA++, our implemented standard EM and the proposed leave-one-out EM are applied to training corpora with the same parameter settings as the previous. In the MT experiments, the WA results of different methods and the gold WA (if available) are employed to extract translation rules; the rest settings including language models, development and test corpus, and parameters are the same as the previous.
On word alignment accuracy, the proposed method achieved improvements of F 1 from 0.041 to 0.090 under the different training corpora (Table  5. The maximum improvement compared with GIZA++ is 0.069 when the training corpus has 4,000 sentence pairs. The maximum improvement compared with our own implement is 0.090 when the training corpus has 64,000 sentence pairs. Figure 2 shows that the extent of improvements slightly changes under different training corpora, but they are all quite stable and obvious. On translation quality, the proposed method achieved improvements of BLEU under the different training corpora. The improvements ranged from 0.19 to 1.72 for phrase-based MT and ranged from 0.25 to 3.02 (see Table 5). The improvements are larger under smaller training corpora (see Figure 3).
In addition, the BLEUs achieved by the proposed method is close to the ones achieved by gold WA annotations. The proposed method slightly outperforms the gold WA annotations when using the full manual WA corpus of 18,057 sentence pairs.

Comparison to l 0 -Normalization and Kneser-Ney Smoothing Methods
The proposed leave-one-word word alignment method was empirically compared to l 0 -normalized GIZA++ (Vaswani et al., 2012) 11 and Kneser-Ney smoothed GIZA++ (Zhang and Chiang, 2014) 12 . l 0 -normalization and Kneser-Ney smoothing methods are established methods to overcome the sparse problem. This enables the probability distributions on rare words to be estimated more effectively. In this way, these two GIZA++ variants are related to the proposed method. l 0 -normalized GIZA++ and Kneser-Ney smoothed GIZA++ were run with the same settings as GIZA++, which came from the default settings of MOSES. For the settings of l 0 -normalized GIZA++ that are not in common with GIZA++ were the default settings. As for Kneser-Ney smoothed GIZA++, the smooth switches of IBM models 1 -4 and HMM model   Table 7. The experiments were run on the Chinese-English language pair. The word alignment quality was evaluated separately for all words and for various levels of rare words. The leave-one-out method outperformed related methods in terms of precision, recall and F 1 when evaluated on all words.
Rare words were categorized based on the number of occurences in the source-language text of the training data. The evaluations were carried out on the subset of alignment links that had a rare word on the source side. Table 7 presents the results for thresholds 1, 2, 5 and 10. The proposed method achieved much higher precision on rare words than the other methods, but performed poorly on recall. The Kneser-Ney Smoothed GIZA++ had higher recall. The explanation might be that the leave-one-out method punishes rare words more than the Kneser-Ney smoothing method, by totally removing the derived expected counts of current sentence pair from the alignment models. This leads to rare words being passively aligned. In other words, the leave-one-out method would align rare words unless the confidence is high. Therefore, we plan to seek a method to integrate Kneser-Ney smoothing into the proposed leave-one-out method in the future work.
The BLEU scores achieved by phrase-based SMT and hierarchical SMT for different alignment methods are presented in Table 7. The proposed method outperforms the other methods. The Kneser-Ney Smoothed GIZA++ performed the second best. We tried to further analyze the relation between word alignment and BLEU, but found the analysis was obscured by the many processing stages. These stages include paral-lel phrase extraction (or translation rule extraction from hierarchical SMT), log-linear model, MERT tuning and practical decoding where a lot of pruning happened.

Conclusion
This paper proposes a leave-one-out EM algorithm for WA to overcome the over-fitting problem that occurs when using standard EM for WA. The experimental results on Chinese-English and Japanese-English corpora show that both the WA accuracy and the end-to-end translation are improved.
In addition, we have a interesting finding about the effect of manual WA annotations on training MT systems. In a Chinese-English parallel training corpus of 18,057 sentence pairs, the manual WA annotation outperformed the unsupervised WA results produced by standard EM algorithms. However, the unsupervised WA results produced by proposed leave-one-out EM algorithm outperformed the manual WA annotation.
Our future work will focus on increasing the gains in end-to-end translation quality through the proposed leave-one-out aligner. It is a interesting question why GIZA++ achieved competitive BLEU scores though its alignment accuracy measured by F 1 was substantially lower. The answer to this question which may reveal essence of good word alignment for MT and eventually help to improve MT. In addition, we plan to improve the proposed method by integrating Kneser-Ney smoothing.