Generalized Agreement for Bidirectional Word Alignment

While agreement-based joint training has proven to deliver state-of-the-art alignment accuracy, the produced word alignments are usually restricted to one-to-one mappings because of the hard constraint on agreement. We propose a general framework to allow for arbitrary loss functions that measure the disagreement between asymmetric alignments. The loss functions can not only be deﬁned between asymmetric alignments but also between alignments and other latent structures such as phrase segmentations. We use a Viterbi EM algorithm to train the joint model since the inference is intractable. Experiments on Chinese-English translation show that joint training with generalized agreement achieves signiﬁcant improvements over two state-of-the-art alignment methods.


Introduction
Word alignment is a natural language processing task that aims to specify the correspondence between words in two languages (Brown et al., 1993). It plays an important role in statistical machine translation (SMT) as word-aligned bilingual corpora serve as the input of translation rule extraction (Koehn et al., 2003;Chiang, 2007;Galley et al., 2006;Liu et al., 2006).
Although state-of-the-art generative alignment models (Brown et al., 1993;Vogel et al., 1996) have been widely used in practical SMT systems, they fail to model the symmetry of word alignment. While word alignments in real-world bilingual data usually exhibit complicated mappings (i.e., mixed with one-to-one, one-to-many, manyto-one, and many-to-many links), these models assume that each target word is aligned to exactly * Corresponding author: Yang Liu. one source word. To alleviate this problem, heuristic methods (e.g., grow-diag-final) have been proposed to combine two asymmetric alignments (source-to-target and target-to-source) to generate symmetric bidirectional alignments (Och and Ney, 2003;Koehn and Hoang, 2007).
Instead of using heuristic symmetrization, Liang et al. (2006) introduce a principled approach that encourages the agreement between asymmetric alignments in two directions. The basic idea is to favor links on which both unidirectional models agree. They associate two models via the agreement constraint and show that agreement-based joint training improves alignment accuracy significantly.
However, enforcing agreement in joint training faces a major problem: the two models are restricted to one-to-one alignments (Liang et al., 2006). This significantly limits the translation accuracy, especially for distantly-related language pairs such as Chinese-English (see Section 5). Although posterior decoding can potentially address this problem, Liang et al. (2006) find that many-to-many alignments occur infrequently because posteriors are sharply peaked around the Viterbi alignments. We believe that this happens because their model imposes a hard constraint on agreement: the two models must share the same alignment when estimating the parameters by calculating the products of alignment posteriors (see Section 2).
In this work, we propose a general framework for imposing agreement constraints in joint training of unidirectional models. The central idea is to use the expectation of a loss function, which measures the disagreement between two models, to replace the original probability of agreement. This allows for many possible ways to quantify agreement. Experiments on Chinese-English translation show that our approach outperforms two state-ofthe-art baselines significantly.  Figure 1: Comparison of (a) independent training without agreement, (b) joint training with agreement, and (c) joint training with generalized agreement. Bold squares are gold-standard links and solid squares are model predictions. The Chinese and English sentences are segmented into phrases in (c). Joint training with agreement achieves a high precision but generally only produces one-to-one alignments. We propose generalized agreement to account for not only the consensus between asymmetric alignments, but also the conformity of alignments to other latent structures such as phrase segmentations.

Asymmetric Alignment Models
Given a source-language sentence e ≡ e I 1 = e 1 , . . . , e I and a target-language sentence f ≡ f J 1 = f 1 , . . . , f J , a source-to-target translation model (Brown et al., 1993;Vogel et al., 1996) can be defined as where a 1 denotes the source-to-target alignment and θ 1 is the set of source-to-target translation model parameters.
Likewise, the target-to-source translation model is given by where a 2 denotes the target-to-source alignment and θ 2 is the set of target-to-source translation model parameters.
Given a training set D = { f (s) , e (s) } S s=1 , the two models are trained independently to maximize the log-likelihood of the training data for each direction, respectively: One key limitation of these generative models is that they are asymmetric: each target word is restricted to be aligned to exactly one source word (including the empty cept) in the sourceto-target direction and vice versa. This is undesirable because most real-world word alignments are symmetric, in which one-to-one, oneto-many, many-to-one, and many-to-many links are usually mixed. See Figure 1(a) for example. Therefore, a number of heuristic symmetrization methods such as intersection, union, and growdiag-final have been proposed to combine asym-metric alignments (Och and Ney, 2003;Koehn and Hoang, 2007).

Alignment by Agreement
Rather than using heuristic symmetrization methods, Liang et al. (2006) propose a principled approach to jointly training of the two models via enforcing agreement: Note that the last term in Eq. (5) encourages the two models to agree on asymmetric alignments. While this strategy significantly improves alignment accuracy, the joint model is prone to generate one-to-one alignments because it imposes a hard constraint on agreement: the two models must share the same alignment when estimating the parameters by calculating the products of alignment posteriors. In Figure 1(b), the two oneto-one alignments are almost identical except for one link. This makes the posteriors to be sharply peaked around the Viterbi alignments (Liang et al., 2006). As a result, the lack of many-to-many alignments limits the benefits of joint training to end-to-end machine translation.

Generalized Agreement for Bidirectional Alignment
Our intuition is that the agreement between two alignments can be defined as a loss function, which enables us to consider various ways of quantification (Section 3.1) and even to incorporate the dependency between alignments and other latent structures such as phrase segmentations (Section 3.2).

Agreement between Word Alignments
The key idea of generalizing agreement is to leverage loss functions that measure the difference between two unidirectional alignments. For example, the last term in Eq. (5) can be re-written as Note that the last term in Eq. (6) is actually the expected value of agreement: Our idea is to replace δ(a 1 , a 2 ) in Eq. (6) with an arbitrary loss function ∆(a 1 , a 2 ) that measures the difference between a 1 and a 2 . This gives the new joint training objective with generalized agreement: Obviously, Liang et al. (2006)'s training objective is a special case of our framework. We refer to its loss function as hard matching: We are interested in developing a soft version of the hard matching loss function because this will help to produce many-to-many symmetric alignments. For example, in Figure 1(c), the two alignments share most links but still allow for disagreed links to capture one-to-many and many-toone links. Note that the union of the two asymmetric alignments is almost the same with the goldstandard alignment in this example.
While there are many possible ways to define a soft matching loss function, we choose the difference between disagreed and agreed link counts because it is easy and efficient to calculate during search: I I I I E S   B  I  I  I  E  B  E  B  I  I  I  E  S   B  I  I  I  E  B  E  B  I  I  I  labels. We expect that word alignment does not violate the phrase segmentation. The word "unofficial" in the C → E alignment is labeled with "-" because "unofficial" and "2002" belong to the same English phrase but their counterparts are separated in two Chinese phrases. Words that do not violate the phrase alignment are labeled with "+". See Section 3.2 for details.

Agreement between Word Alignments and Phrase Segmentations
Our framework is very general and can be extended to include the agreement between word alignment and other latent structures such as phrase segmentations. The words in a Chinese sentence often constitute phrases that are translated as units in English and vice versa. Inspired by the alignment consistency constraint widely used in translation rule extraction (Koehn et al., 2003), we make the following assumption to impose a structural agreement constraint between word alignment and phrase segmentation: source words in one source phrase should be aligned to target words belonging to the same target phrase and vice versa.
For example, consider the C → E alignment in Figure 2. We segment Chinese and English sentences into phrases, which are sequences of consecutive words. Since "2002" and "APEC" belong to the same English phrase, their counterparts on the Chinese side should also belong to one phrase.
While this assumption can potentially improve the correlation between word alignment and phrase-based translation, a question naturally arises: how to segment sentences into phrases? Instead of leveraging chunking, we treat phrase segmentation as a latent variable and train the joint alignment and segmentation model from unlabeled data in an unsupervised way.
Formally, given a target-language sentence f ≡ f J 1 = f 1 , . . . , f J , we introduce a latent variable b ≡ b J 1 = b 1 , . . . , b J to denote a phrase segmentation. Each label b j ∈ {B, I, E, S}, where B denotes the beginning word of a phrase, I denotes the internal word, E denotes the ending word, and S denotes the one-word phrase. Figure 2 shows the label sequences for the sentence pair.
We use a first-order HMM to model phrase segmentation of a target sentence: Similarly, the hidden Markov model for the phrase segmentation of the source sentence can be defined as Then, we can combine word alignment and phrase segmentation and define the joint training objective as log P (f (s) |e (s) ; θ 1 ) +  Algorithm 1: A Viterbi EM algorithm for learning the joint word alignment and phrase segmentation model from bilingual corpus. D is a bilingual corpus, Θ (k) is the set of model parameters at the k-th iteration, H (k) is the set of Viterbi latent variables at the k-th iteration.
where the expected loss is given by We define a new loss function segmentation violation to measure the degree that an alignment violates phrase segmentations.
where β(a 1 , j, b 1 , b 2 ) evaluates whether two links l 1 = (j, a j ) and l 2 = (j + 1, a j+1 ) violate the phrase segmentation: 1. f j and f j+1 belong to one phrase but e a j and e a j+1 belong to two phrases, or 2. f j and f j+1 belong to two phrases but e a j and e a j+1 belong to one phrase.
The β function returns 1 if there is violation and 0 otherwise. for all s ∈ {1, . . . , S} do 4:â 1 ← ALIGN(f (s) , e (s) , θ 1 ) 5: returnĤ 13: end procedure Algorithm 2: A search algorithm for finding the Viterbi latent variables.â 1 andâ 2 denote Viterbi alignments,b 1 andb 2 denote Viterbi segmentations. They form a starting point h 0 for the hill climbing algorithm, which keeps changing alignments and segmentations until the model score does not increase.ĥ is the final set of Viterbi latent variables for one sentence.
In Figure 2, we use "+" to label words that do not violate the phrase segmentations and "-" to label violations.
In practice, we combine the two loss functions to enable word alignment and phrase segmentation to benefit each other in a joint search space: ∆ SM+SV (a 1 , a 2 , b 1 , b 2 ) = ∆ SM (a 1 , a 2 ) + ∆ SV (a 1 , a 2 , b 1 , b 2 ) (16) 4 Training Liang et al. (2006) indicate that it is intractable to train the joint model. For simplicity and efficiency, they exploit a simple heuristic procedure that leverages the product of posterior marginal probabilities. The intuition behind the heuristic is that links on which two models disagree should be discounted because the products of the marginals are small (Liang et al., 2006).
Unfortunately, it is hard to develop a similar heuristic for our model that allows for arbitrary loss functions. Alternatively, we resort to a Viterbi EM algorithm, as shown in Algorithm 1. The algorithm takes the training data D = { f (s) , e (s) } S s=1 as input (line 1). We use Θ (k) = θ to denote the set of model parameters at the k-th iteration. After initializing the model parameters (line 2), the algorithm alternates between searching for the Viterbi alignments and segmentationsĤ (k) using the SEARCH procedure (line 4) and updating model parameters using the UPDATE procedure (line 5). The algorithm terminates after running for K iterations.
It is challenging to search for the Viterbi alignments and segmentations because of complicated structural dependencies. As shown in Algorithm 2, our strategy is first to find Viterbi alignments and segmentations independently using the ALIGN and SEGMENT procedures (lines 4-7), which then serve as a starting point for the HILLCLIMB procedure (lines 8-9). Figure 3 shows three operators we use in the HILLCLIMB procedure. The MOVE operator moves a link in an alignment, the MERGE operator merges two phrases into one phrase, and the SPLIT operator splits one phrase into two smaller phrases. Note that each operator can be further divided into two variants: one for the source side and another for the target side.

Setup
We evaluate our approach on Chinese-English alignment and translation tasks.
The training corpus consists of 1.2M sentence pairs with 32M Chinese words and 35.4M English words. We used the SRILM toolkit (Stolcke, 2002) to train a 4-gram language model on the Xinhua portion of the English GIGAWORD corpus, which contains 398.6M words. For alignment evaluation, we used the Tsinghua Chinese-English word alignment evaluation data set. 1 The evaluation metric is alignment error rate (AER) (Och and Ney, 2003). For translation evaluation, we used the NIST 2006 dataset as the development set and the NIST 2002NIST , 2003NIST , 2004NIST , 2005NIST , and 2008 datasets as the test sets. The evaluation metric is case-insensitive BLEU (Papineni et al., 2002).
We used both phrase-based (Koehn et al., 2003) and hierarchical phrase-based (Chiang, 2007) translation systems to evaluate whether our approach improves translation performance. For the phrase-based model, we used the open-source toolkit Moses (Koehn and Hoang, 2007). For the hierarchical phrase-based model, we used an inhouse re-implementation on par with state-of-theart open-source decoders.
For GIZA++, we trained IBM Model 4 in two directions with the default setting and used the grow-diag-final heuristic to generate symmetric alignments. For BERKELEY, we trained joint HMMs using the default setting. The hyperparameter of posterior decoding was optimized on the development set.
We used first-order HMMs for both word alignment and phrase segmentation. Our joint alignment and segmentation model were trained using the Viterbi EM algorithm for five iterations. Note that the Chinese-to-English and English-to-Chinese alignments are generally non-identical but share many links (see Figure 1(c)). Then, we used the grow-diag-final heuristic to generate symmetric alignments.  Table 1: Comparison with GIZA++ and BERKELEY. "word-word" denotes the agreement between Chinese-to-English and English-to-Chinese word alignments. "word-phrase" denotes the agreement between word alignments and phrase segmentations. "HM" denotes the hard matching loss function, "SM" denotes soft matching, and "SV" denotes segmentation violation. "GDF" denotes grow-diag-final. "PD" denotes posterior decoding. The BLEU scores are evaluated on NIST08 test set.  Table 2: Results on (hierarchical) phrase-based translation. The evaluation metric is case-insensitive BLEU. "HM" denotes the hard matching loss function, "SM" denotes soft matching, and "SV" denotes segmentation violation. "*": significantly better than GIZA++ (p < 0.05). "**": significantly better than GIZA++ (p < 0.01). "+": significantly better than BERKELEY (p < 0.05). "++": significantly better than BERKELEY (p < 0.01).

Comparison with GIZA++ and BERKELEY
trains two models jointly with the hard-matching (i.e., HM) loss function and uses posterior decoding for symmetrization. For our approach, we distinguish between two variants: 1. Imposing agreement between word alignments (i.e., word-word) that uses the soft matching loss function (i.e., SM) (see Section 3.1); 2. Imposing agreement between word alignments and phrase segmentations (i.e., wordword, word-phrase) that uses both the soft matching and segmentation violation loss functions (i.e., SM+SV) (see Section 3.2).
We used the grow-diag-final heuristic for symmetrization.
For the alignment evaluation, we find that our approach achieves higher AER scores than the two baseline systems. One possible reason is that links in the intersection of two symmetric alignments or two symmetric models agree usually correspond to sure links in the gold-standard annotation. Our approach loosens the hard constraint on agreement and makes the posteriors less peaked around the Viterbi alignments.
For the translation evaluation, we used the phrase-based system Moses to report BLEU scores on the NIST 2008 test set. We find that both the two variants of our approach significantly outperforms the two baselines (p < 0.01). Table 2 shows the results on phrase-based and hierarchical phrase-based translation systems. We find that our approach systematically outperforms GIZA++ and BERKELEY on all NIST datasets. In particular, generalizing the agreement to model the discrepancy between word alignment and phrase segmentation is consistently beneficial for improving translation quality, suggesting that it is important to introduce structural constraints into word alignment to increase the correlation between alignment and translation.

Results on (Hierarchical) Phrase-based Translation
While "SM+SV" improves over "SM" significantly on phrase-based translation, the margins on the hierarchical phrase-based system are relatively smaller. One possible reason is that the "SV"  Table 3: Agreement evaluation of GIZA++, BERKELEY and our approach. The F1 score reflects how well two asymmetric alignments agree with each other.
loss function can better account for phrase-based rather than hierarchical phrase-based translation. It is possible to design new loss functions tailored to hierarchical phrase-based translation. We also find that the BLEU scores of BERKE-LEY on hierarchical phrase-based translation are much lower than those on phrase-based translation. This might result from the fact that BERKE-LEY is prone to produce one-to-one alignments, which are not optimal for hierarchical phrasebased translation. Table 3 compares how well two asymmetric models agree with each other among GIZA++, BERKELEY and our approach. We use F1 score to measure the degree of agreement:

Agreement Evaluation
where A C→E is the set of Chinese-to-English alignments on the training data and A E→C is the set of English-to-Chinese alignments. It is clear that independent training leads to low agreement and joint training results in high agreement. BERKELEY achieves the highest value of agreement because of the hard constraint.

Related Work
This work is inspired by two lines of research: (1) agreement-based learning and (2) joint modeling of multiple NLP tasks.

Agreement-based Learning
The key idea of agreement-based learning is to train a set of models jointly by encouraging them to agree on the hidden variables (Liang et al., 2006;Liang et al., 2008). This can also be seen as a particular form of posterior constraint or posterior regularization (Graça et al., 2007;Ganchev et al., 2010). The agreement is prior knowledge and indirect supervision, which helps to train a more reasonable model with biased guidance.
While agreement-based learning provides a principled approach to training a generative model, it constrains that the sub-models must share the same output space. Our work extends (Liang et al., 2006) to introduce arbitrary loss functions that can encode prior knowledge. As a result, Liang et al. (2006)'s model is a special case of our framework. Another difference is that our framework allows for including the agreement between word alignment and other structures such as phrase segmentations and parse trees.

Joint Modeling of Multiple NLP Tasks
It is well accepted that different NLP tasks can help each other by providing additional information for resolving ambiguities. As a result, joint modeling of multiple NLP tasks has received intensive attention in recent years, including phrase segmentation and alignment (Zhang et al., 2003), alignment and parsing (Burkett et al., 2010), tokenization and translation (Xiao et al., 2010), parsing and translation , alignment and named entity recognition (Chen et al., 2010;Wang et al., 2013).
Among them, Zhang et al. (2003)'s integrated search algorithm for phrase segmentation and alignment is most close to our work. They use Point-wise Mutual Information to identify possible phrase pairs. The major difference is we train models jointly instead of integrated decoding.

Conclusion
We have presented generalized agreement for bidirectional word alignment. The loss functions can be defined both between asymmetric alignments and between alignments and other latent structures such as phrase segmentations. We develop a Viterbi EM algorithm to train the joint model. Experiments on Chinese-English translation show that joint training with generalized agreement achieves significant improvements over two baselines for (hierarchical) phrase-based MT systems. In the future, we plan to investigate more loss functions to account for syntactic constraints.