Improving fast_align by Reordering

fast align is a simple, fast, and efﬁ-cient approach for word alignment based on the IBM model 2 . fast align performs well for language pairs with relatively similar word orders; however, it does not perform well for language pairs with drastically different word orders. We propose a segmenting-reversing reordering process to solve this problem by alternately applying fast align and re-ordering source sentences during training. Experimental results with Japanese-English translation demonstrate that the proposed approach improves the performance of fast align signiﬁcantly without the loss of efﬁciency. Experiments using other languages are also reported.


Introduction
Aligning words in a parallel corpus is a basic task for almost all state-of-the-art statistical machine translation (SMT) systems. Word alignment is used to extract translation rules in various way, such as the phrase pairs used in a phrase-based (PB) SMT system (Koehn et al., 2003), the hierarchical rules used in a HIERO system (Chiang, 2007), and the sophisticated translation templates used in tree-based SMT systems (Liu et al., 2006).
fast align 5 (Dyer et al., 2013) is a recently proposed word alignment approach based on the reparameterization of the IBM model 2, which is usually referred to as a zero-order alignment model (Och and Ney, 2003). Taking advantage of the simplicity of the IBM model 2, fast align introduces a "tension" parameter to model the overall accordance of word orders and an efficient parameter re-estimation algorithm is devised. It has been reported that the fast align approach is more than 10 times faster than baseline GIZA++, with comparable results in end-to-end French-, Chinese-, and Arabic-to-English translation experiments.
However, the simplicity of the IBM model 2 also leads to a limitation. As demonstrated in this study, fast align does not perform well when applied to language pairs with drastically different word orders, e.g., Japanese and English. The problem is because of the IBM model 2's intrinsic inability to handle complex distortions. In this study, we propose a simple and efficient reordering approach to improve the fast align's performance in such situations, referred to as segmenting-reversing (seg rev). Our motivation is to apply a rough but robust reordering to make the source and target sentences have more similar word orders, where fast align can show its power. Specifically, seg rev first segments a source-target sentence pair into a sequence of minimal monotone chunk pairs 6 based on the automatically generated word alignment. Within the chunk pairs, source word sequences are examined to determine whether they should be completely reversed or the original order should be retained. The objective of this step is to convert the source sentence to a roughly target-like word order. The seg rev process is applied recursively but not deeply (only twice in our ex-the cutting unit 8 is activated by this signal .  Figure 1: Example of seg rev applied to a word-aligned English-Japanese sentence pair. Based on the word alignment, the source sentence is reordered in a target-like order after applying seg rev twice.

この
periments) for each source sentence in the training data. Consequently, the seg rev process is lightweight and shallow. Local word sequences, except those at chunk boundaries, are not scrambled, while global word orders are re-arranged if there are large chunks. Our primary experimental results for Japanese-English translation show that applying seg rev significantly improves fast align's performance to a level comparable to GIZA++. The training time becomes 2-4 times that of a baseline fast align, which is still at least 2 -4 times faster than the training time required by baseline GIZA++. Results for German-, French-, and Chinese-English translations are also reported.

Segmenting-Reversing Reordering
The seg rev is inspired by the "REV preorder" (Katz-Brown and Collins, 2008), which is a simple pre-reordering approach originally designed for the Japanese-to-English translation task. More efficient pre-reordering approaches usually require trained parsers and sophisticated machine learning frameworks (de Gispert et al., 2015;Hoshino et al., 2015). We adopt the REV method in Katz-Brown and Collins (2008) considering it is the simplest and lightest pre-reordering approach (to our knowledge), which may bring a minimal effect on the efficiency of fast align.
An example seg rev process, where the word alignment is generated by fast align, is illustrated in Fig. 1. The example we selected has relatively correct word alignment and seg rev performs well. In general cases, the alignment has significant noise and the reordering is rougher .
In Algorithm 2, the main for loop (line 3) scans the source sentence from the beginning to the end to obtain monotone segmentation. The foreach (line 5) and if (line 11) are general phrase pair extraction process. The if (line 13) guarantees that the chunk is monotone on the target side. The rev function (line 16), which is described in Algorithm 3, determines whether the sub-sequence from s start to s end should be reversed by examining the related alignment A sub . For example, in the first application shown in Fig.1 Algorithm 3 performs the reversal. We count the concordant and discordant pairs 9 and reverse 7 The sub-sequences are based on the input, not the original sentence, e.g., sub-sequence[0 : 1]contains the 8th and 7th word of the original source sentence in the 2nd application.
8 Unaligned words between chunks on the source side are problematic. They are not touched by line 18. Although they can be attached to preceding or succeeding chunks, we do not use further heuristics to handle them. An example is the drifting "the" in the English sentence in Fig. 1, which our approach cannot handle properly. 9 As used in Kendall's τ or Goodman and Kruskal's γ.
Algorithm 4 describes the training framework, where fast align and seg rev are applied alternately. To generate word alignment, fast align is run bi-directionally and symmetrization heuristics are applied to reduce noise (line 11). In each iteration, the source sentences for seg rev are the original sentences, Algorithm 3: rev input : index sequence I sub ; word alignment A sub ; 1 con ← 0; dis ← 0; 2 foreach unordered tuple ((i0, j0), (i1, j1)) : and fast align uses the reordered sentences with the exception of the first iteration. The word alignment generated is thus based on the reordered source sentences; consequently, the recorded permutation (line 14) is used to recover word alignment before the next iteration. The permutation is a one-to-one mapping; therefore, recovering is realized by the inverse mapping of the permutation, which transfers the source-side word alignment indices to match the original source sentences.
The time complexity of Algorithm 3 is O(l 2 ), where l is the size of A sub that is related to the chunk size. If the average chunk size is a constant C depending on languages pairs or data sets, then the time complexity of Algorithm 2 is O(C · I 2 ) assuming J and the size of A are both linear against I. The average chunk size will be reduced when seg rev is applied successively; therefore, the time required for subsequent seg rev processes will decrease. In practice, compared with the training time required by fast align, seg rev processing time is negligible. Note that seg rev processes are accelerated easily by parallel processing.

Experiments and Discussion
We applied the proposed approach to Japanese-English translation, a language pair with dramatically different word orders. In addition, we applied the approach to German-English translation, a language pair with relatively different word orders among European languages.
For Japanese-English translation, we used NTCIR-7 PAT-MT data (Fujii et al., 2008). For German-English translation, we used the Europarl v7 corpus 10 (Koehn, 2005) for training, the WMT 08 11 / WMT 09 12 test sets for development / testing, respectively. Default settings for the PB SMT in MOSES 13 (Koehn et al., 2007) were used, except for Japanese-English translations where the distortion-limit was set to 12 to reach a recently reported baseline (Isozaki et al., 2012). MERT (Och, 2003) was used to tune development set parameter weights and BLEU (Papineni et al., 2002) was used on test sets to evaluate the translation performance. Bootstrap sampling (Koehn, 2004) was employed to test statistical significance using bleu kit 14 .
We compared GIZA++ and fast align with default settings. GIZA++ was used as a module of MOSES. The bi-directional outputs of fast align were symmetrized by atools in cdec 15 (Dyer et al., 2010), and further training steps were conducted using MOSES. grow-diagfinal-and symmetrization was used consistently in the experiments. For the the proposed approach, we set δ = 2 and M = 4 in Algorithm 4. Note that δ can be set to a larger value and seg rev could be applied repeatedly until no additional reordering is possible. As mentioned, the word alignment is noisy and our intention is a robust and rough process; therefore, we restricted seg rev to two applications and did not consider the difference in sentence lengths or different languages during training. Within each iteration, fast align was run with default settings, except initial diagonal-ja-en en-ja de-en en-de GIZA++ 28.8 30.8 18.2 12.9 FA λ ini =4.0 28.1 ‡ 29.5 ‡ 18.0 † 12.7 † FA λ ini =0.1 28.0 ‡ 29.8 ‡ 17.5 ‡ 12.5 ‡ iteration 2 28.3 † 30.9 17.9 ‡ 12.8 iteration 3 28.4 † 30.1 ‡ 18.1 12.7 † iteration 4 28.8 30.7 18.1 12.7 † Table 1: Test set BLEU scores for Japanese-English and German-English translations. ( ‡ , statistical significance at p < 0.01; † , at p < 0.05; boldface, no significance; all compared with GIZA++) tension (λ ini ) was set to 0.1 in the first iteration, to avoid overly strong monotone preference at the beginning of training.
Experimental results for Japanese-English and German-English translations in both directions are listed in Table 1. The first two rows show the baseline performance. fast align (using a default λ ini = 4.0) performance was statistically significantly lower than GIZA++, particularly for Japanese-English translation. The following four rows show the results of the proposed approach. For the first iteration, λ ini was set to 0.1, and the performance did not change significantly. The translations from English improved (equal to GIZA++) at the second iteration. However, translations to English improved more slowly. We attribute the difference in improvement rates between translation to and from English to the relatively fixed word order of English, whereby the reordering process is easier and more consistent. Note that once translations from English improved in the second iteration, performance decreased in the following iterations. The results in Table 1 were obtained using predictable-seed for tuning, which generated determinate results. Another attempt using random seeds to tune returned test set BLEU scores of 30.5, 30.4 on en-ja and 12.8, 12.8 on ende, for iterations 3 and 4, respectively. These four scores had no statistical significance against GIZA++. The instability is largely due to the alignment of function words, which affects translation performance (Riesa et al., 2011). The alignment does not change significantly after the second iteration; however, it is unstable around function words, 16 because seg rev does not process  unaligned function words between chunks. Our approach is too rough to handle function words precisely. We plan to address this in future.
We also tested our approach on French-and Chinese-to-English translations. The results are listed in Table 2. GIZA++ and fast align showed no statistically significant difference in performance, which is consistent with Dyer et al. (2013). The proposed approach did not affect performance for French-and Chinese-to-English translations. These results are expected as these language pairs have similar word orders.
With regard to processing time, a naïve, singlethread implementation of seg rev in C++ took approximately 60s / 40s in the first / second application on the entire Japanese-English corpus 17 . The recover process took less than 30s in each iteration. In contrast, fast align, although very fast, took approximately one hour for one round of training (using five iterations for its log-linear model) on the same corpus. Therefore, the additional time required in our approach is quite small and can be ignored compared with the training time of fast align. 18

Conclusion and Future Work
We have proposed a simple and efficient approach to improve the performance of fast align on language pairs with drastically different word orders. With the proposed approach, fast align obtained results comparable with GIZA++, and its efficiency is retained. We are investigating further properties of seg rev and plan to extend it to achieve greater stability and efficiency. 19