Consistency-Aware Search for Word Alignment

As conventional word alignment search algorithms usually ignore the consistency constraint in translation rule extraction, improving alignment accuracy does not necessarily increase translation quality. We propose to use coverage , which reﬂects how well extracted phrases can recover the training data, to enable word alignment to model consistency and correlate better with machine translation. This can be done by introducing an objective that maximizes both alignment model score and coverage. We introduce an efﬁcient algorithm to calculate coverage on the ﬂy during search. Experiments show that our consistency-aware search algorithm signiﬁcantly outperforms both generative and discriminative alignment approaches across various languages and translation models.


Introduction
Word alignment, which aims to identify the correspondence between words in two languages, plays an important role in statistical machine translation (Brown et al., 1993). Word alignment and translation rule extraction often constitute two consecutive steps in the training pipeline. Wordaligned bilingual corpora serve as a fundamental resource for translation rule extraction, not only for phrase-based models (Koehn et al., 2003;Och and Ney, 2004), but also for syntax-based models (Chiang, 2005;Galley et al., 2006). Dividing alignment and extraction into two separate steps significantly improves the efficiency and scalability of parameter estimation as compared with directly learning translation models from bilingual * Corresponding author: Yang Liu. corpora (Marcu and Wong, 2002;DeNero and Klein, 2008;Cohn and Blunsom, 2009).
However, separating word alignment from translation rule extraction suffers from a major problem: maximizing the accuracy of word alignment does not necessarily lead to the improvement of translation quality. A number of studies show that alignment error rate (AER) only has a loose correlation with BLEU (Callison-Burch et al., 2004;Goutte et al., 2004;Ittycheriah and Roukos, 2005). Ayan and Dorr (2006) find that precision-oriented alignments result in better translation performance than recall-oriented alignments. Fraser and Marcu (2007) show that using AER and balanced F-measure can only partially explain the effect of alignment quality on BLEU for several language pairs. We believe that the correlation problem arises from the discrepancy between word alignment and translation rule extraction. On one hand, aligners seek to find the alignment with the highest alignment model score, without regard to structural constraints. Consequently, sensible translation rules may not be extracted because they violate consistency constraints required by translation rule extraction (Och and Ney, 2004). Wang et al. (2010) find that the standard alignment tools are not optimal for training syntax-based models. As a result, they have to resort to realigning. On the other hand, the consistency constraint used in most translation rule extraction algorithms tolerate wrong links within consistent phrase pairs. Chiang (2007) uses the union of two unidirectional alignments, which usually has a low precision, for extracting hierarchical phrases. Therefore, it is important to include both alignment model score and the consistency constraint in the optimization objective of word alignment.
In this work, we propose to use coverage, which measures how well extracted phrases can Figure 1: (a) An alignment resulting in a set of bilingual phrases (highlighted by shading) that can recover the training example, and (b) an alignment resulting in a set of bilingual phrases that fails to fully recover the training example. We assume the maximum phrase length w = 3. Our approach aims to avoid adding links that both have low posterior probabilities and hurt the recovery (e.g., the link between "huiwu" and "hold"). recover the training data, to bridge word alignment and (hierarchical) phrase-based translation. We introduce a new alignment search algorithm with an objective that maximizes both alignment model score and coverage while keeping the training algorithm unchanged. The coverage of an alignment is calculated on the fly during search using a local phrase extraction algorithm. Experiments show that our approach achieves significant improvements over state-of-the-art baselines across various languages and translation models.

Background
We begin by introducing the preliminaries of word alignment and phrase-based translation.
Definition 1 Given a source-language sentence f = f J 1 = f 1 . . . f J and a target-language sentence e = e I 1 = e 1 . . . e I , an alignment a is a subset of the Cartesian product of the word positions of two sentences: a ⊆ {(j, i) : j = 1, . . . , J; i = 1, . . . , I}.
Figure 1(a) shows an alignment for a Chinese sentence "oumeng he eluosi shounao huiwu zai mosike juxing" and an English sentence "EU and Russia hold summit in Moscow". We use black circles to denote links. The link (1, 1) indicates that the first Chinese word "oumeng" and the first English word "EU" are translations of each other.
Definition 2 Given a training example f , e, a , a bilingual phrase B is a pair of source and target phrases: For example, ("zai mosike", "in Moscow") in Figure 1 can be denoted as a bilingual phrase B = (f 7 6 , e 7 6 ). For convenience, We use B.j 1 and B.j 2 to denote the beginning and ending positions of the source phrase in B, respectively. B.i 1 and B.i 2 are defined likewise for the target side.
Definition 3 A bilingual phrase B = (f j 2 j 1 , e i 2 i 1 ) is said to be tight if and only if all boundary words (i.e., f j 1 , f j 2 , e i 1 , and e i 2 ) are aligned. Otherwise, it is a loose bilingual phrase.
Definition 4 (Och and Ney, 2004) Given a training example f , e, a , a bilingual phrase B = (f j 2 j 1 , e i 2 i 1 ) is said to be consistent with the word alignment a if and only if: 1. No words in the source phrase are aligned with words outside the target phrase and vice versa: 2. At least one word in the source phrase is aligned with at least one word in the target Alignment consistency forms the basis of translation rule extraction in modern SMT systems (Koehn and Hoang, 2007;Chiang, 2007;Galley et al., 2006;Liu et al., 2006). In Figure 1, (f 3 1 , e 3 1 ) is consistent with the alignment because all words in "oumeng he eluosi" are aligned with all words in "EU and Russia". In contrast, in Figure 1(b), "huiwu shounao" and "hold summit" are not consistent with the alignment because "hold" is also aligned to a word "juxing" outside.
However, alignment consistency only defines a loose relationship between alignment and translation. A phrase pair consistent with alignment tolerates wrong inside links. For example, even if "oumeng" is aligned with "Russia", (f 3 1 , e 3 1 ) is still consistent. This is one possible reason that maximizing alignment accuracy does not necessarily lead to improved translation performance.

Modeling Consistency in Word Alignment
Our intuition is that including the consistency constraint in word alignment can hopefully reduce the discrepancy between alignment and translation. While this idea has been suggested by a number of authors (e.g., (Deng and Zhou, 2009;DeNero and Klein, 2010)), our goal is to optimize arbitrary alignment models with respect to end-toend translation in the search phase without labeled data (see Related Work for detailed comparison). A natural way is to include consistency in the optimization objective as a regularization term. However, as consistency is only defined at the phrase level (see Definition 4), we need a sentence-level measure to reflect how well an alignment conforms to the consistency constraint. A straightforward measure is the number of bilingual phrases consistent with the alignment (phrase count for short), which is easy and efficient to calculate during search (Deng and Zhou, 2009). Unfortunately, optimizing with respect to phrase count is prone to yield alignments with very few links in a biased way, which result in a large number of bilingual phrases extracted from a small fraction of the training data. Another alternative is reachability (Liang et al., 2006a;Yu et al., 2013) that indicates whether there exists a full derivation to recover the training data. However, calculating reachability faces a major problem: a large portion of training data cannot be fully recovered due to noisy alignments and the distortion limit (Yu et al., 2013).
In this work, we propose coverage, which reflects how well extracted phrases can recover the training data, to measure the sentence-level consistency. In the following, we will introduce a number of definitions to facilitate the exposition.
The indicator function expr returns 1 if the boolean expression expr is true and returns 0 otherwise. For example, in Figure 1(a), "oumeng" and "EU" are covered by the bilingual phrase B = (f 3 1 , e 3 1 ).

Definition 6 Given a set of bilingual phrases
The definition for a target word is similar.
For example, in Figure 1(a), all source and target words are covered by the bilingual phrase set. In Figure 1(b), the source words "shounao", "huiwu", "juxing" and the target words "hold" and "summit" are not covered.
Definition 7 Given a sentence pair f , e and a phrase length limit w 1 , the hard coverage of an alignment a is defined as a boolean value: where B = EXTRACT(f , e, a, 1, J, 1, I, w) is the set of consistent bilingual phrases extracted from the sentence pair using a standard phrase extraction algorithm (Och and Ney, 2004). The function δ returns true if the two parameters are same and returns false otherwise.
Algorithm 1 A consistency-aware search algorithm for word alignment.
for all a, B ∈ open do 9: for all l ∈ J × I − a do 10: a ← a ∪ {l} 11: B ← UPDATE(f , e, a, l, B, w) 12: if GAIN(f , e, a, a , w, θ) > 0 then 13: Depending on the tightness of extracted phrases (see Definition 3), we further distinguish between C h+t (f , e, a, w) and C h+l (f , e, a, w), which denote hard coverage calculated with tight and loose phrases, respectively.
Hard coverage denotes whether extracted phrases can fully recover the training data. For example, the values of hard coverage for Figures 1(a) and 1(b) are 1 and 0, respectively. As most training examples can hardly be fully recovered, we introduce soft coverage to better account for partially recoverable training data.
Definition 8 Given a sentence pair f , e and a phrase length limit w, the soft coverage of an alignment a is defined as Similarly, we also distinguish between C s+t and C s+l depending on the tightness of extracted phrases.
Definition 9 Given a word-aligned bilingual corpus D = { f (s) , e (s) , a (s) } S s=1 and a phrase length limit w, the corpus-level soft coverage is defined as Algorithm 2 Updating the set of extracted bilingual phrases after adding a link.
The corpus-level hard coverage is defined likewise.

Consistency-Aware Search
While Deng and Zhou (2009) focus on introducing an effectiveness function such as phrase count into alignment symmetrization, we are interested in guiding the search algorithms of arbitrary alignment models using coverage. Therefore, the objective of our search algorithm is defined as score(f , e, a, w, θ) = M (f , e, a, θ) + λC(f , e, a, w) where M (f , e, a, θ) is alignment model score, θ is a set of model parameters, C(f , e, a, w) is coverage (either hard or soft), and λ is a hyperparameter that controls the preference between alignment model score and coverage. 2 Therefore, the decision rule is given bŷ where A(f , e) is a set of all possible alignments for the sentence pair. Algorithm 1 shows the consistency-aware search algorithm for word alignment. The input of the algorithm includes a source sentence f , a target sentence e, a set of model parameters θ, phrase length limit w, pruning parameters β and b, and the number of most likely alignments to be retaind n (line 1). Inspired by Liu et al. (2010), 2 Note that training algorithms are unchanged. We only introduce a new search algorithm that takes coverage into consideration. We leave consistency-aware training algorithms for arbitrary alignment models for future work. the algorithm starts with an empty alignment a together with an empty phrase set B. We use open to store active alignments during search and N to store top-n alignments after search (lines 2-4). The procedure ADD(open, a, B , β, b) adds a, B to open and discards any alignment that has a score worse than β multiplied by the best score in the list or the score of the b-th best alignment (line 5). For each iteration (line 6), we use a list closed to store promising alignments that have higher scores than the current alignment (line 8). For every possible link l (line 9), the algorithm produces a new alignment a and updates the phrase set by calling a procedure UPDATE(f , e, a, l, B, w) (lines 10-11). Then, the algorithm calls a procedure GAIN(f , e, a, a , w, θ) to calculate the difference of model score after adding the link l: score(f , e, a , w, θ) − score(f , e, a, w, θ) If a has a higher score, it is added to closed (line 13). We also update N to retain the top n alignment explored during the search (line 15). This process iterates until the model score does not increase.
Algorithm 2 describes how to update the set of extracted bilingual phrases after adding a link. Our idea is to only update the phrases near the added link l and keep other phrases unchanged. This strategy improves the efficiency by avoiding extracting phrases from the entire sentence pair. The algorithm first removes bilingual phrases that are either in the same row or in the same column with l (lines 2-7). For example, in Figure 1, the following bilingual phrases are removed after adding the link between "huiwu" and "hold" because the link breaks the consistency: ("shounao huiwu", "summit") ("juxing", "hold") Other phrases out of the reach of the added link remain unchanged.
Then, the algorithm extracts bilingual phrases near l by calling the procedure EXTRACT. Note that the phrase extraction is restricted to a local region (j 1 , j 2 , i 1 , i 2 ) by the phrase length limit w. We use l.i and l.j to denote the source and target positions of the link, respectively. Moses. "h" denotes "hard", "s" denotes "soft", "l" denotes "loose", and "t" denotes "tight". The BLEU scores were calculated on the development set. For quick validation, we used a small fraction of the training data to train the phrase-based model.

Languages and Datasets
We evaluated our approach in terms of alignment and translation quality on five language pairs:

Chinese-English (ZH-EN), Czech-English (CS-EN), German-English (DE-EN), Spanish-English (ES-EN), and French-English (FR-EN). The evaluation metrics for alignment and translation are alignment error rate (AER) (Och and Ney, 2003)
and case-insensitive BLEU (Papineni et al., 2002), respectively. For Chinese-English, the training data consists of 1.2M pairs of sentences with 30.9M Chinese words and 35.5M English words. We used the SRILM toolkit (Stolcke, 2002) to train a 4gram language model on the Xinhua portion of the English GIGAWORD corpus, which contains 398.6M words. For alignment evaluation, we used the Tsinghua Chinese-English word alignment evaluation data set (Liu and Sun, 2015). 3 For translation evaluation, we used the NIST 2006 dataset as the development set and the NIST 2002NIST , 2003NIST , 2004NIST , 2005 and 2008 datasets as the test sets.
For other languages, the training data is Europarl v7. The English language model trained on the Xinhua portion of the English GIGA-WORD corpus was also used for translation from European languages to English. For translation evaluation, we used the "news-test2012" dataset that contains 3,003 sentences as the development set and the "news-test2013" dataset that contains 3,000 sentences as the test set.  Table 2: Comparison of different alignment methods on the Chinese-English dataset. "GDF" denotes the grow-diag-final heuristic. "phrase count" denotes optimizing with respect to maximizing the number of extracted tight phrases. We used Moses to extract loose phrases from word-aligned training data for all methods. "# bp" denotes the number of extracted bilingual phrases, "# sp" denotes the number of source phrases, "# tp" denotes the number of target phrases, "# sw" denotes the source vocabulary size, "# tw" denotes the target vocabulary size. We report BLEU scores on the NIST 2005 test set.  Table 3: Translation evaluation on different alignment models. We apply our approach to both generative and discriminative alignment models. "generative" denotes applying the grow-diag-final heuristic to the alignments produced by IBM Model 4 in two directions. "discriminative" denotes the log-linear alignment model (Liu et al., 2010). Adding coverage leads to significant improvements. We use "**" to denote that the difference is statistically significant at p < 0.01 level.

Alignment Models
We apply our approach to both generative and discriminative alignment models. For generative models, we used GIZA++ (Och and Ney, 2003) to train IBM Model 4 in two directions. To calculate a model score for symmetrized alignments, we follow Liang et al. (2006b) to leverage link posterior marginal probabilities. For discriminative models, we used the open-source toolkit TsinghuaAligner (Liu and Sun, 2015) that implements the log-linear alignment model as described in (Liu et al., 2010). The model score for the log-linear model is also defined using link posteriors.

Translation Models
Two kinds of translation models, phrase-based (Koehn et al., 2003) and hierarchical phrase-based (Chiang, 2007), are used to evaluate whether our approach improves the correlation between alignment and translation. For the phrase-based model, we used the open-source toolkit Moses (Koehn and Hoang, 2007). For the hierarchical phrase-based model, we used an in-house reimplementation on par with state-of-the-art open-source decoders.

Comparison of Different Settings
We first investigate the optimal setting for coverage (hard vs. soft, tight vs. loose) on the Chinese-English dataset. For quick validation, we used a subset of the training data to train the phrase-based model using Moses. We used the development set to optimize the scaling factor λ (see Eq. (4)) and set it to 0.3 in our experiments. Table 1 compares C h+l , C h+t , C s+l , and C s+t . We find that the "soft + tight" combination (i.e., C s+t ) yields the highest BLEU score on the development set. One possible reason is that tight phrases are usually of high quality and soft coverage allows for taking full advantage of the training data. On the contrary, C h+t yields the lowest BLEU score because hard coverage fails to distinguish between partially recoverable training examples as it assigns zero to all partially recoverable data.
Then, we investigate the effect of the phrase length limit w in Algorithm 1 on translation quality. We find w = 7 achieves the best result, which is consistent with the default setting in Moses. As a result, we used C s+t and set w = 7 in the following experiments.

Comparison of Different Alignment Methods
We compare our approach with a number of alignment methods in terms of AER and BLEU, including IBM Model 4 in two directions (C → E and E → C), symmetrization heuristics (Intersection, Union, grow-diag-final), and consistencyaware models (tight phrase count and coverage). We used Moses to extract loose bilingual phrases from word-aligned bilingual corpora from all methods. Note that our approach uses C s+t for finding alignments, from which Moses extracts loose phrases. Table 2 lists the numbers of extracted bilingual phrases ("# bp"), source phrases ("# sp"), target phrases ("# tp"), source vocabulary size ("# sw"), and target vocabulary size ("# tw"). We find that a very large number of loose phrases can be extracted from the Intersection alignments, which also have the highest vocabulary sizes. However, a large portion of words in these phrases are actually unaligned, resulting in low translation quality.
We observe that adding consistency, either in terms of phrase count or coverage, significantly improves alignment accuracy by a large margin, suggesting that imposing structural constraint helps to reduce alignment errors. Our approach outperforms all methods in terms of BLEU significantly. Note that the coverage itself does not correlate well with BLEU. It is important to achieve a balance between model score and coverage. As mentioned in Section 5.2, we set λ = 0.3 in our experiments.

Translation Evaluation on Different Alignment Models
We apply our approach to both generative (Brown et al., 1993) and discriminative (Liu et al., 2010) alignment models. As shown in Table 3, we find that adding coverage to the optimization objective significantly improves the BLEU scores. All differences are statistically significant at p < 0.01 level. This finding suggests that our approach generalizes well to various alignment models.

Translation Evaluation on Different Translation Models
We also evaluated our approach on both phrasebased and hierarchical phrase-based models. As shown in Table 4, adding coverage to generative models leads to significant improvements for both models. All the differences are statistically significant at p < 0.01 level. Although coverage is designed for extracting phrases, using coverage is still beneficial to hierarchical phrase-based models because hierarchical phrases are derived from phrases consistent with word alignment. 4

Translation Evaluation on Different Language Pairs
Finally, we report BLEU scores across five language pairs in Table 5 We find that our approach outperforms the baseline statistically significantly at p < 0.01 for four language pairs and p < 0.05 for one language pair. Therefore, using coverage to bridge word alignment and machine translation can hopefully benefit more languages.

Related Work
Our work is inspired by three lines of research: (1) reachability in discriminative training of translation models, (2) structural constraints for alignment, and (3) learning with constraints.

Reachability in Discriminative Training of Translation Models
Discriminative training algorithms for statistical machine translation often need reachable training examples to find full derivations for updating model parameters (Liang et al., 2006a;Yu et al., 2013). Yu et al. (2013) Table 4: Translation evaluation on different translation models. For translation, We used both phrasebased and hierarchical phrase-based models. For alignment, we used the generative model. "generative" denotes applying the grow-diag-final heuristic to the alignments produced by IBM Model 4 in two directions. Adding coverage leads to significant improvements. We use "**" to denote that the difference is statistically significant at p < 0.01 level.  Table 5: Translation evaluation on five language pairs. "generative" denotes applying the grow-diag-final heuristic to the alignments produced by IBM Model 4 in two directions. We use "*" and"**" to denote that the difference is statistically significant at p < 0.05 and p < 0.01, respectively. Note that ZH-EN uses four references and other language pairs only use single references. due to noisy alignments and distortion limit. They find that most reachable sentences are short and generally literal.
We borrow the idea of measuring the degree of recovering training data from reachability but ignore the dependency between bilingual phrases for efficiency. To calculate reachability, one needs to figure out a full derivation, in which the bilingual phrases cover the training data and do not intersect with each other. Yu et al. (2013) indicate that using forced decoding to select reachable sentences with an unlimited distortion limit runs in O(2 n n 3 ) time. In contrast, calculating coverage is much easier and more efficient by ignoring the dependency between phrases but still retains the spirit of measuring recovery.

Structural Constraints for Alignment
Modeling structural constraints in alignment has received intensive attention in the community, either directly modeling phrase-to-phrase alignment (Marcu and Wong, 2002;DeNero and Klein, 2008;Cohn and Blunsom, 2009) or intersecting synchronous grammars with alignment (Wu, 1997;Zhang and Gildea, 2005;Haghighi et al., 2009).
Our work is in spirit most close to (Deng andZhou, 2009) andKlein, 2010). Deng and Bowen (2009) cast combining IBM Model 4 alignments in two directions as an optimization problem driven by an effectiveness function. They evaluate the impact of adding or removing a link with respect to phrase extraction using the effectiveness function of phrase count. The major difference is that we generalize their idea to arbitrary alignment models in the search phase rather than bidirectional alignment combination in the post-processing phase. In addition, we find that using coverage instead of phrase count results in better translation performance (see Table 2).
DeNero and Klein (2010) develop a discriminative model of extraction sets and optimize an extraction-based loss function with respect to translation. Their model is capable of predicting the extracted phrase set. While their approach relies on annotated data for training the discriminative model, our method only needs to tune the scaling factor λ on the development set. In addition, our approach is very general and can easily apply to arbitrary alignment models by appending a term to the optimization objective.

Learning with Constraints
Our work is also related to learning with constraints such as constraint-driven learning (Chang et al., 2007) and posterior regularization (Ganchev et al., 2010).
The basic idea is to inject prior knowledge to the model as a regularization term. The major difference is that our coverage regularizer is independent of model parameters. As a result, alignment models can still be trained independently.

Conclusion
In this work, we have presented a general framework for optimizing word alignment with respect to machine translation. We introduce coverage to measure how well extracted bilingual phrases can recover the training data. We develop a consistency-aware search algorithm that calculates coverage on the fly during search efficiently. Experiments show the our approach is effective in both alignment and translation tasks across various alignment models, translation models, and language pairs.
In the future, we plan to apply our approach to syntax-based models (Galley et al., 2006;Liu et al., 2006;Shen et al., 2008) and include the constituency constraint in the optimization objective. It is also interesting to develop consistency-aware training algorithms for word alignment.