Towards a Convex HMM Surrogate for Word Alignment

,


Introduction
The IBM translation models are widely used in modern statistical translation systems. Typically, one seeds more complex models with simpler models, and the parameters of each model are estimated through an Expectation Maximization (EM) procedure. Among the IBM Models, perhaps the most elegant is the HMM model (Vogel et al., 1996). The HMM is the last model whose expectation step is * Currently at Google. † Currently on leave at Google. both exact and simple, and it attains a level of accuracy that is very close to the results achieved by much more complex models. In particular, experiments have shown that IBM Models 1, 2, and 3 all perform worse than the HMM and Model 4 benefits greatly from being seeded by the HMM (Och and Ney, 2003).
In this paper we make the following contributions: • We derive a new alignment model which combines the structure of the HMM and IBM Model 2 and show that its performance is very close to that of the HMM. There are several reasons why such a result would be of value (for more on this, see  and (Simion et al., 2015a), for example).
• The main goal of this work is not to eliminate highly non-convex models such as the HMM entirely but, rather, to develop a new, powerful, convex alignment model and thus push the boundary of these theoretically justified techniques further. Building on the work of (Simion et al., 2015a), we derive a convex relaxation for the new model and show that its performance is close to that of the HMM. Although it does not beat the HMM, the new convex model improves upon the standard IBM Model 2 significantly. Moreover, the convex relaxation also performs better than the strong IBM 2 variant FastAlign (Dyer et al., 2013), IBM Model 3, and the other available convex alignment models detailed in (Simion et al., 2015a) and .
• We derive a parameter estimation algorithm for new model and its convex relaxation based on the EM algorithm. Our model has both HMM emission probabilities and IBM Model 2's distortions, so we can use Model 2 to seed both the model's lexical and distortion parameters. For the convex model, we need not use any initialization heuristics since the EM algorithm we derive is guaranteed to converge to a local optima that is also global.
The goal of our work is to present a model which is convex and has state of the art empirical performance. Although one step of this task was achieved for IBM Model 2 (Simion et al., 2015a), our target goal deals with a much more local-optima-laden, non-convex objective. Finally, whereas IBM 2 in some ways leads to a clear method of attack, we will discuss why the HMM presents challenges that require the insertion of this new surrogate.
Notation. We adopt the notation introduced in (Och and Ney, 2003) of having 1 m 2 n denote the training scheme of m IBM Model 1 EM iterations followed by initializing Model 2 with these parameters and running n IBM Model 2 EM iterations. We denote by H the HMM and note that it too can be seeded by running Model 1 followed by Model 2. Additionally, we denote our model as 2 H , and note that it has distortion parameters like IBM Model 2 and emission parameters like that of the HMM. Under this notation, we let 1 m 2 n 2 o H denote running Model 1 for m iterations, then Model 2 for n iteration, and then finally our Model for o iterations. As before, we are seeding from the more basic to the more complex model in turn. We denote the convex relaxation of 2 H by 2 HC . Throughout this paper, for any integer N , we use [N ] to denote {1 . . . N } and [N ] 0 to denote {0 . . . N }. Finally, in our presentation, "convex function" means a function for which a local maxima also global, for example, f (x) = −x 2 .

IBM Models 1 and and the HMM
In this section we give a brief review of IBM Models 1, 2, and the HMM, as well as the the optimization problems arising from these models. The standard approach for optimization within these latent variable models is the EM algorithm.
Throughout this section, and the remainder of the paper, we assume that our set of training examples is (e (k) , f (k) ) for k = 1 . . . n, where e (k) is the k'th English sentence and f (k) is the k'th French sentence. Following standard convention, we assume the task is to translate from French (the "source" language) into English (the "target" language) 1 . We use E to denote the English vocabulary (set of possible English words), and F to denote the French vocabulary. The k'th English sentence is a sequence of words e (k) 1 . . . e (k) l k where l k is the length of the k'th English sentence, and each e where m k is the length of the k'th French sentence, and each f (k) j ∈ F . We define e (k) 0 for k = 1 . . . n to be a special NULL word (note that E contains the NULL word).
For each English word e ∈ E, we will assume that D(e) is a dictionary specifying the set of possible French words that can be translations of e. The set D(e) is a subset of F . In practice, D(e) can be derived in various ways; in our experiments we simply define D(e) to include all French words f such that e and f are seen in a translation pair.
Given these definitions, the IBM Model 2 optimization problem is presented in several sources, for example, . The parameters in this problem are t(f |e) and d(i|j, l, m). The objective function for IBM Model 2 is then the loglikelihood of the training data; we can simplify the log-likelihood (Koehn, 2008) as This last simplification is crucial as it allows for a simple multinomial EM implementation, and can be done for IBM Model 1 as well (Koehn, 2008). Furthermore, the ability to write out the marginal likelihood per sentence in this manner has seen other applications: it was crucial, for example, in deriving a convex relaxation of IBM Model 2 and solving the new problem using subgradient methods .
An improvement on IBM Model 2, called the HMM alignment model, was introduced by Vogel et al (Vogel et al., 1996). For this model, the distortion parameters are replaced by emission parameters d(a j |a j−1 , l). These emission parameters specify the probability of the next alignment variable for the j th target word is a j , given that the previous source word was aligned to a target word whose position was a j−1 in a target sentence with length of l. The objective of the HMM is given by and we present this in Fig 1. We note that unlike IBM Model 2, we cannot simplify the exponential sum within the log-likelihood of the HMM, and so EM training for this model requires the use of a special EM implementation knows as the Baum-Welch algorithm (Rabiner and Juang., 1986).
Once these models are trained, each model's highest probability (Viterbi) alignment is computed. For IBM Models 1 and 2, the Viterbi alignment splits easily (Koehn, 2008). For the HMM, dynamic programming is used (Vogel et al., 1996). Although it is non-convex and thus its initialization is important, the HMM is the last alignment model in the classical setting that has an exact EM procedure (Och and Ney, 2003): from IBM Model 3 onwards heuristics are used within the expectation and maximization steps of each model's associated EM procedure.

Distortion and emission parameter structure
The structure of IBM Model 2's distortion parameters and the HMM's emission parameters is important and is used in our model as well, so we Input: Define E, F , (e (k) , f (k) , l k , m k ) for k = 1 . . . n, D(e) for e ∈ E as in Section 2. Parameters: with respect to the t(f |e) parameters d(i |i, l). detail this here. We are using the roughly same structure as (Liang et al., 2006) and (Dyer et al., 2013): the distortions and emissions of our model are parametrized by forcing the model to concentrate its alignments on the diagonal.

Distortion Parameters for IBM2
Let λ > 0. For the IBM Model 2 distortions we set the NULL word probability as d(0|j, l, m) = p 0 , where p 0 = 1 l+1 and note that this will generally depend on the target sentence length within a bitext training pair that we are considering. For i = 0 which satisfies we set where Z λ (j, l, m) is a normalization constant as in (Dyer et al., 2013).

Emission Parameters for HMM
Let θ > 0. For the HMM emissions we first set the NULL word generation to d(0|i, l) = p 0 , with where Z θ (i, l, m) is a suitable normalization constant. Lastly, if i = 0 so that we are jumping from the NULL word onto a possibly different word, we set d(i |0, l) = p 0 . Aside from making the NULL word have uniform jump probability, the above emission parameters are modeled to favor a jumping to an adjacent English word.

Combining IBM Model 2 and the HMM
In deriving the new HMM surrogate, our main goal was to allow the current alignment to know as much as possible about the previous alignment variable and still have a likelihood that factors as that of IBM Model 2 Koehn, 2008). We combine IBM Model 2 and the HMM by incorporating the generation of words using the structure of both models. The model we introduce, IBM2-HMM, is displayed in Fig 2. Consider a target-source sentence pair (e, f ) with |e| = l and |f | = m. For source sentence positions j and j + 1 we have source words f j and f j+1 and we assign a joint probability involving the alignments a j and a j+1 as: From the equation above, we note that we use the IBM Model 2's word generation method for position j and the HMM generative structure for position j + 1. The generative nature of the above procedure introduces dependency between adjacent words two at a time. Since we want to mimic the HMM's structure as much as possible, we devise our likelihood function to mimic the HMM's dependency between alignments using q. Essentially, we move the source word position j from 1 to m and allow for overlapping terms when j ∈ {2, . . . , m − 1}. In what follows, we describe this representation in detail. The likelihood in Eq. 16 is actually the sum of two likelihoods which use equations Eq. 5 and 6 repeatedly. To this end, we will discuss how our objective is actually where a (k) and b (k) both are alignment vectors whose components are independent and can take on any values in [l k ] 0 . To see how p(f, a, b|e) comes about, note that we could generate the sentence f by generating pairs (1, 2), (3, 4), (5, 6), . . . using equations Eqs. 5 and 6 for each pair. Taking all this together, the upshot of our discussion is that generating the pair (e, f ) in this way gives us that the likelihood for an alignment a would be given by: Using the same idea as above, we could also skip the first target word position and generate pairs (2, 3), (4, 5), . . . using Eqs. 5 and 6. Under this second generative method, the joint probability for f and alignment b is: Finally, we note that if m is even we do not generate f 1 and f m under p 2 but we do generate these words under p 1 . Similarly, if m is odd we do not generate f 1 under p 2 and we do not generate f m under p 1 ; however in this case as in the first, we still generate these missing words under the other generative method. Using p(f, a, b|e) = p 1 (f, a|e)p 2 (f, b|e) and factoring the log-likelihood as in IBM Model 1 and 2 (Koehn, 2008), we get the log-likelihood in Fig 2. Finally, we note that our model's log-likelihood could be viewed as the sum of the log-likelihoods of a model which generates (e, f ) using p 1 and another model which generates sentences using p 2 . These models share parameters but generate words using different recipes, as discussed above.

The parameter estimation for IBM2-HMM
To fully optimize our new model (over t, λ, and θ), we can use an EM algorithm in the same fashion as Input: Define E, F , (e (k) , f (k) , l k , m k ) for k = 1 . . . n, D(e) for e ∈ E as in Section 2. Parameters: • A parameter t(f |e) for each e ∈ E, f ∈ D(e).
• A parameter θ > 0 for emission centering. Constraints: with respect to the parameters t(f |e), d(i |i, l) d(i|j, l, m), and q(j, i, i , l k , m k ) set as Figure 2: The IBM2-HMM Optimization Problem. We use equation (5) within the likelihood definition. (Dyer et al., 2013). Specifically, for the model in question the EM algorithm still applies but we have to use a gradient-based algorithm within the learning step. On the other hand, since such a gradient-based method introduces the necessary complication of a learning rate, we could also optimize the objective by picking θ and λ via cross-validation and using a multinomial EM algorithm for the learning of the lexical t terms. For our experiments, we opted for this simpler choice: we derived a multinomial EM algorithm and cross-validated the centering parameters for the distortion and emission terms. With λ and θ fixed, the derivation of this algorithm is very similar to the one used for IBM2-HMM's convex re-laxation and this uses the path discussed in (Simion et al., 2015a) and (Simion et al., 2015b). We detail the EM algorithm for the convex relaxation below.

A Convex HMM Surrogate
In this section we detail a procedure to get a convex relaxation for IBM2-HMM. Let (t, d) be all the parameters of the HMM. As a first step in getting a convex HMM, one could follow the path developed in (Simion et al., 2015a) and directly replace the HMM's objective terms In particular, the geometric mean function ((Boyd and Vandenberghe, 2004)) and, for a given sentence pair (e (k) , f (k) ) with alignment a (k) we can find a projection matrix P so that are exactly the parameters used in the term above (in particular, t, d are the set of all parameters whilet,d are the set of parameters for the specific training pair k; P projects from the full space onto only the parameters used for training pair k). Given this, we then have that g(t, d) = h(P(t, d)) = h(t,d) is convex and, by composition, so is log g(t, d) (see (Simion et al., 2015a;Boyd and Vandenberghe, 2004) for details; the main idea lies in the fact that as linear transformations preserve convexity, so do compositions of convex functions with increasing convex functions such as log). Finally, if we run this plan for all terms in the objective, the new objective is convex since it is the sum of convex functions (the new optimization problem is convex as it has linear constraints). Although this gives a convex program, we observed that the powers being so small made the optimized probabilities very uninformative (i.e. uniform). The above makes sense: no matter what the parameters are, we will easily get the 1 we seek for each term in the objective since all terms are taken to a low ( 1 2m k ) power . Since this direct relaxation does not yield fruit, we next could turn to our model. Developing its relaxation in the vein of (Simion et al., 2015a), we could be to let d(i|j, l, m) and d(i |i, l) be multinomial probabilities (that is, these parameters would not have centering parameters λ and θ and would be just standard probabilities as in the GIZA++ versions of the HMM and IBM Model 2 (Och and Ney, 2003)) and replace all the terms q(j, i , i, l, m) in (16) by (q(j, i , i, l, m)) 1 4 . Although this method is feasible, experiments showed that the relaxation is not very competitive and performs on par with IBM Model 2; this relaxation is far in performance from the HMM even though we are relaxing (only) the product of 4 terms (lastly, we mention that we tried other variants were we replaced d(i|j, l, m)d(i |i, l) by d(i, i |j, l, m) so that we would have only three terms; unfortunately, this last attempt also produced parameters that were "too uniform").
The above analysis motivates why we defined our model as we did: we now have only two terms to relax. In particular, to rectify the above, we left in place the structure discussed in Section 3 and made λ and θ be tuning parameters which we can cross-validate for on a small held-out data set. This last constraint effectively removes the distortion and emission parameters from the model but we still maintain the structural property of these parameters: we maintain their favoring the diagonal or adjacent alignment. To get the relaxation, we replace q(j, i, i , l, m) by and set the proportionality constant to be d(i|j, l, m)d(i |i, l). Using this setup we now have a convex objective to optimize over. In particular, we've formulated a convex relaxation of the IBM2-HMM problem which, like the Support Vector Machine, includes parameters that can be cross-validated over (Boyd and Vandenberghe, 2004).

Constraints:
∀e ∈ E, f ∈ D(e), t(f |e) ≥ 0 (18) Objective: Maximize with respect to the parameters t(f |e) and p(j, i, i , l k , m k ) set as

An EM algorithm for the convex surrogate
The EM algorithm for the convex relaxation of our surrogate is given in Fig 4. As the model's objective is the sum of the objectives of two models generated by a multinomial rule, we can get a very succinct EM algorithm. For more details on this and a similar derivation, please refer to (Simion et al., 2015a), (Koehn, 2008) or (Simion et al., 2015b). For this algorithm, we again note that the distortion and emission parameters are constants so that the only estimation that needs to be conducted is on the lexical t terms.
To be specific, we have that the M step requires optimizing In the above, we have that are constants proportional to and gotten through the E step. This optimization step is very similar to the regular Model 2 M step since the β drops down using log t β = β log t; the exact same count-based method can be applied. The upshot of this is given in Fig 4; similar to the logic above for 2 HC , we can get the EM algorithm for 2 H .

Decoding methods for IBM2-HMM
When computing the optimal alignment we wanted to compare our model with the HMM as closely as possible. Because of this, the most natural method of evaluating the quality of the parameters would be to use the same rule as the HMM. Specifically, for a sentence pair (e, f ) with |e| = l and |f | = m, in HMM decoding we aim to find (a 1 . . . a m ) which maximizes max a 1 ,...,am m j=1 t(f j |e a j )d(a j |a j−1 , l).
As is standard, dynamic programming can now be used to find the Viterbi alignment. Although there are a number of ways we could define the optimal alignment, we felt that the above would be the best since it tests dependance between alignment variables and allows for easy comparison with the GIZA++ HMM. Finding the optimal alignment under the HMM setting is labelled "HMM" in Table 1. We can also find the optimal alignment by taking the objective literally (see (Simion et al., 2014) for a similar argument dealing with the convex relaxation of IBM Model 2) and computing max a 1 ...am p 1 (f, a|e)p 2 (f, a|e).
Above, we are asking for the optimal alignment that yields the highest probability alignment through generating technique p 1 and p 2 . This method of decoding is a lot like the HMM style and also relies 1: Input: Define E, F , (e (k) , f (k) , l k , m k ) for k = 1 . . . n, D(e) for e ∈ E as in Section 2. Two parameters λ, θ > 0 picked by cross-validation so that the distortions and emissions are constants obeying the structure in Section 3. An integer T specifying the number of passes over the data. 2: Parameters: • A parameter t(f |e) for each e ∈ E, f ∈ D(e).
counts(e  on dynamic programming. In this case we have the recursion for Q Joint given by The alignment results gotten by decoding with this method is labelled "Joint" in Table 1.

537
In this section we describe experiments using the IBM2-HMM optimization problem combined with the EM algorithm for parameter estimation.

Data Sets
We use data from the bilingual word alignment workshop held at HLT-NAACL 2003 (Michalcea and Pederson, 2003). We use the Canadian Hansards bilingual corpus, with 743,989 English-French sentence pairs as training data, 37 sentences of development data, and 447 sentences of test data (note that we use a randomly chosen subset of the original training set of 1.1 million sentences, similar to the setting used in (Moore, 2004)). The development and test data have been manually aligned at the word level, annotating alignments between source and target words in the corpus as either "sure" (S) or "possible" (P ) alignments, as described in (Och and Ney, 2003). As is standard, we lower-cased all words before giving the data to GIZA++ and we ignored NULL word alignments in our computation of alignment quality scores.

Methodology
We test several models in our experiments. In particular, we empirically evaluate our models against the GIZA++ IBM Model 3 and HMM, as well as the FastAlign IBM Model 2 implementation of (Dyer et al., 2013) that uses Variational Bayes. For each of our models, we estimated parameters and got alignments in turn using models in the source-target and target-source directions; using the same setup as , we present the gotten intersected alignments. In training, we employ the standard practice of initializing non-convex alignment models with simpler non-convex models. In particular, we initialize, the GIZA++ HMM with IBM Model 2, IBM Model 2 with IBM Model 1, and IBM2-HMM and IBM Model 3 with IBM Model 2 preceded by Model 1. Lastly, for FastAlign, we initialized all parameters uniformly since this empirically was a more favorable initialization, as discussed in (Dyer et al., 2013).
We measure the performance of the models in terms of Precision, Recall, F-Measure, and AER using only sure alignments in the definitions of the first three metrics and sure and possible alignments in the definition of AER , as in  and (Marcu et al., 2006). For our experiments, we report results in both AER (lower is better) and F-Measure (higher is better) (Och and Ney, 2003). Table 1 shows the alignment summary statistics for the 447 sentences present in the Hansard test data. We present alignments quality scores using either the FastAlign IBM Model 2, the GIZA++ HMM, and our model and its relaxation using either the "HMM" or "Joint" decoding. First, we note that in deciding the decoding style for IBM2-HMM, the HMM method is better than the Joint method. We expected this type of performance since HMM decoding introduces positional dependance among the entire set of words in the sentence, which is shown to be a good modeling assumption (Vogel et al., 1996).
From the results in Table 1 we see that the HMM outperforms all other models, including IBM2-HMM and its convex relaxation. However, IBM2-HMM is not far in AER performance from the HMM and both it and its relaxation do better than FastAlign or IBM Model 3 (the results for IBM Model 3 are not presented; a one-directional English-French run of 1 5 2 5 3 15 gave AER and F-Measure numbers of 0.1768 and 0.6588, respectively, and this was behind both the IBM Model 2 FastAlign and our models).
As a further set of experiments, we also appended an IBM Model 1 or IBM Model 2 objective to our models's original objectives, so that the constraints and parameters are the same but now we are maximizing the average of two log-likelihoods. With regard to the EM optimization, we would only need to add another δ parameter: we'd now have probabilities i )d(i|j, l l , m k ) (this is for IBM Model 2 smoothing; we have d = 1 for IBM 1 smoothing) and δ 2 [i, i ] ∝ p(j, i, i .l k , m k ) in the EM Algorithm that results (for more, see (Simion et al., 2015a)). We note that the appended IBM Model 2 objective is still convex if we fix the distortions' λ parameter and then optimize for the t parameters via EM (thus, model 2 HC is still convex). For us, there were significant gains, especially in the convex model. The results for all these experiments are shown in Table 2, with IBM 2 smoothing for the convex model displayed in the rightmost column.
Finally, we also tested our model in the full  SMT pipeline using the cdec system (Dyer et al., 2013). For our experiments, we compared our models' alignments (gotten by training 1 5 2 5 H and 2 5 HC ) against the alignments gotten by the HMM (1 5 2 5 H 5 ), IBM Model 4 (1 5 H 5 3 3 4 3 ), and FastAlign. Unfortunately, we found that all 4 systems led to roughly the same BLEU score of 40 on a Spanish-English training set of size 250000 which was a subset of version 7 of the Europarl dataset (Dyer et al., 2013). For our development and test sets, we used data each of size roughly 1800 and we preprocessed all data by considering only sentences of size less than 80 and filtering out sentences which had a very large (or small) ratio of target and source sentence lengths. Although the SMT results were not a success in that our gains were not significant, we felt that the experiments at least highlight that our model mimics the HMM's alignments even though its structure is much more local. Lastly, we in regards to the new convex model's performance, we observe much better alignment quality than any other convex alignment models in print, for example, (Simion et al., 2015a).  Table 2: Alignment quality results for IBM2-HMM and its relaxation using IBM 1 and IBM 2 smoothing (in this case, "smoothing" means adding these log-likelihoods to the original objective as in . For the convex relaxation of IBM2-HMM, we can only smooth by adding in the convex IBM Model 1 objective, or by adding in an IBM Model 2 objective where the distortions are taken to be constants (these distortions are identical to the ones that are used within the relaxation itself and are cross-validated for optimal λ).

Conclusions and Future Work
Our work has explored some of the details of a new model which combines the structure of IBM Model 2 the alignment HMM Model. We've shown that this new model and its convex relaxation performs very close to the standard GIZA++ implementation of the HMM. Bridging the gap between the HMM and convex models proves difficult for a number of reasons (Guo and Schuurmans, 2007). In this paper, we have introduced a new set of ideas aimed at tightening this gap.