On A Strictly Convex IBM Model 1

IBM Model 1 is a classical alignment model. Of the ﬁrst generation word-based SMT models, it was the only such model with a concave objective function. For concave optimization problems like IBM Model 1, we have guarantees on the convergence of optimization algorithms such as Expectation Maximization (EM). However, as was pointed out recently, the ob-jective of IBM Model 1 is not strictly concave and there is quite a bit of alignment quality variance within the optimal solu-tion set. In this work we detail a strictly concave version of IBM Model 1 whose EM algorithm is a simple modiﬁcation of the original EM algorithm of Model 1 and does not require the tuning of a learning rate or the insertion of an l 2 penalty. More-over, by addressing Model 1’s shortcomings, we achieve AER and F-Measure improvements over the classical Model 1 by over 30%.


Introduction
The IBM translation models were introduced in (Brown et al., 1993) and were the first-generation Statistical Machine Translation (SMT) systems. In the current pipeline, these word-based models are the seeds for more sophisticated models which need alignment tableaus to start their optimization procedure. Among the original IBM Models, only IBM Model 1 can be formulated as a concave optimization problem. Recently, there has been some research on IBM Model 2 which addresses either the model's non-concavity (Simion et al., 2015) * Currently on leave at Google Inc. New York. or over parametrization (Dyer et al., 2013). We make the following contributions in this paper: • We utilize and expand the mechanism introduced in (Simion et al., 2015) to construct strictly concave versions of IBM Model 1 1 . As was shown in (Toutanova and Galley, 2011), IBM Model 1 is not a strictly concave optimization problem. What this means in practice is that although we can initialize the model with random parameters and get to the same objective cost via the EM algorithm, there is quite a bit of alignment quality variance within the model's optimal solution set and ambiguity persists on which optimal solution truly is the best. Typically, the easiest way to make a concave model strictly concave is to append an l 2 regularizer. However, this method does not allow for seamless EM training: we have to either use a learning-rate dependent gradient based algorithm directly or use a gradient method within the M step of EM training. In this paper we show how to get via a simple technique an infinite supply of models that still allows a straightforward application of the EM algorithm.
• As a concrete application of the above, we detail a very simple strictly concave version of IBM Model 1 and study the performance of different members within this class. Our strictly concave models combine some of the elements of word association and positional dependance as in IBM Model 2 to yield a significant model improvement. Furthermore, we now have guarantees that the solution we find is unique.
• We detail an EM algorithm for a subclass of strictly concave IBM Model 1 variants. The EM algorithm is a small change to the original EM algorithm introduced in (Brown et al., 1993).
Notation. Throughout this paper, for any positive integer N , we use [N ] to denote {1 . . . N } and [N ] 0 to denote {0 . . . N }. We denote by R n + the set of nonnegative n dimensional vectors. We denote by [0, 1] n the n−dimensional unit cube.

IBM Model 1
We begin by reviewing IBM Model 1 and introducing the necessary notation. To this end, throughout this section and the remainder of the paper we assume that our set of training examples is (e (k) , f (k) ) for k = 1 . . . n, where e (k) is the k'th English sentence and f (k) is the k'th French sentence. Following standard convention, we assume the task is to translate from French (the "source" language) into English (the "target" language). We use E to denote the English vocabulary (set of possible English words), and F to denote the French vocabulary. The k'th English sentence is a sequence of words e (k) 1 . . . e (k) l k where l k is the length of the k'th English sentence, and each e (k) i ∈ E; similarly the k'th French sentence is a sequence f 0 for k = 1 . . . n to be a special NULL word (note that E contains the NULL word).
For each English word e ∈ E, we will assume that D(e) is a dictionary specifying the set of possible French words that can be translations of e. The set D(e) is a subset of F . In practice, D(e) can be derived in various ways; in our experiments we simply define D(e) to include all French words f such that e and f are seen in a translation pair.
Given these definitions, the IBM Model 1 optimization problem is given in Fig. 1 and, for example, (Koehn, 2008). The parameters in this problem are t(f |e). The t(f |e) parameters are translation parameters specifying the probability of English word e being translated as French word f . The objective function is then the log-likelihood of the training data (see Eq. 3): and C is a constant that can be ignored.
Input: Define E, F , L, M , (e (k) , f (k) , l k , m k ) for k = 1 . . . n, D(e) for e ∈ E as in Section 2. Parameters: • A parameter t(f |e) for each e ∈ E, f ∈ D(e).

Constraints
: with respect to the t(f |e) parameters. While IBM Model 1 is concave optimization problem, it is not strictly concave (Toutanova and Galley, 2011). Therefore, optimization methods for IBM Model 1 (specifically, the EM algorithm) are typically only guaranteed to reach a global maximum of the objective function (see the Appendix for a simple example contrasting convex and strictly convex functions). In particular, although the objective cost is the same for any optimal solution, the translation quality of the solutions is not fixed and will still depend on the initialization of the model (Toutanova and Galley, 2011).

A Strictly Concave IBM Model 1
We now detail a very simple method to make IBM Model 1 strictly concave with a unique optimal solution without the need for appending an l 2 loss.
Theorem 1. Consider IBM Model 1 and modify its objective to be where h i,j,k : R + → R + is strictly concave. With the new objective and the same constraints as IBM Model 1, this new optimization problem is strictly concave.
Proof. To prove concavity, we now show that the new likelihood function is strictly concave (concavity follows in the same way trivially). Suppose by way of contradiction that there is (t) � = (t � ) and θ ∈ (0, 1) such that equality hold for Jensen's inequality. Since Using Jensen's inequality, the monotonicity of the log, and the above strict inequality we have The IBM Model 1 strictly concave optimization problem is presented in Fig. 2. In (7) it is crucial that each h i,j,k be strictly concave within . For example, we have that √ x 1 + x 2 is concave but not strictly concave and the proof of Theorem 1 would break down. To see this, we can consider (x 1 , x 2 ) � = (x 1 , x 3 ) and note that equality holds in Jensen's inequality. We should be clear: the main reason why Theorem 1 works is that we have h i,j,k are strictly concave (on R + ) and all the lexical probabilities that are arguments to L are present within the log-likelihood.
Input: Define E, F , L, M , (e (k) , f (k) , l k , m k ) for k = 1 . . . n, D(e) for e ∈ E as in Section 2. A set of strictly concave functions h i,j,k : R + → R + .

Parameters:
• A parameter t(f |e) for each e ∈ E, f ∈ D(e).

Constraints:
∀e with respect to the t(f |e) parameters.

Parameter Estimation via EM
For the IBM Model 1 strictly concave optimization problem, we can derive a clean EM Algorithm if we base our relaxation of To justify this, we first need the following: Lemma 1. Consider h : R + → R + given by h(x) = x β where β ∈ (0, 1). Then h is strictly concave.
Proof. The proof of this lemma is elementary and follows since the second derivative given by h �� (x) = β(β − 1)x β−2 is strictly negative.
For our concrete experiments, we picked a model based on Lemma 1 and used h(x) = αx β with α, β ∈ (0, 1) so that Using this setup, parameter estimation for the new model can be accomplished via a slight modification of the EM algorithm for IBM Model 1. In particular, we have that the posterior probabilities of this model factor just as those of the standard Model 1 and we have an M step that requires optimizing � a (k) q(a (k) |e (k) , f (k) ) log p(f (k) , a (k) |e (k) ) 1: Input: Define E, F , L, M , (e (k) , f (k) , l k , m k ) for k = 1 . . . n, D(e) for e ∈ E as in Section 2. An integer T specifying the number of passes over the data. A set of weighting parameter α(e, f ), β(e, f ) ∈ (0, 1) for each e ∈ E, f ∈ D(e). A tuning parameter λ > 0.

4: EM
16: for all i = 0 . . . l k do 17: 18: where are constants gotten in the E step. This optimization step is very similar to the regular Model 1 M step since the β drops down using log t β = β log t; the exact same count-based method can be applied. The details of this algorithm are in Fig. 3.

Choosing α and β
The performance of our new model will rely heavily on the choice of α(e (k) j ) ∈ (0, 1) we use. In particular, we could make β depend on the association between the words, or the words' positions, or both. One classical measure of word association is the dice coefficient (Och and Ney, 2003) given by dice(e, f ) = 2c(e, f ) c(e) + c(f ) .
In the above, the count terms c are the number of training sentences that have either a particular word or a pair of of words (e, f ). As with the other choices we explore, the dice coefficient is a fraction between 0 and 1, with 0 and 1 implying less and more association, respectively. Additionally, we make use of positional constants like those of the IBM Model 2 distortions given by In the above, Z(j, l, m) is the partition function discussed in (Dyer et al., 2013). The previous measures all lead to potential candidates for β(e, f ), we have t(f |e) ∈ (0, 1), and we want to enlarge competing values when decoding (we use αt β instead of t when getting the Viterbi alignment). The above then implies that we will have the word association measures inversely proportional to β, and so we set β(e, f ) = 1 − dice(e, f ) or β(e, f ) = 1 − d(i|j, l, m). In our experiments we picked α(f i ) = d(i|j, l k , m k ) or 1; we hold λ to a constant of either 16 or 0 and do not estimate this variable (λ = 16 can be chosen by cross validation on a small trial data set).

Data Sets
For our alignment experiments, we used a subset of the Canadian Hansards bilingual corpus with 247,878 English-French sentence pairs as training data, 37 sentences of development data, and 447 sentences of test data (Michalcea and Pederson, 2003). As a second validation corpus, we considered a training set of 48,706 Romanian-English sentence-pairs, a development set of 17 sentence pairs, and a test set of 248 sentence pairs (Michalcea and Pederson, 2003).

Methodology
Below we report results in both AER (lower is better) and F-Measure (higher is better) (Och and Ney, 2003) for the English → French translation direction. To declare a better model we have to settle on an alignment measure. Although the relationship between AER/F-Measure and translation quality varies (Dyer et al., 2013), there are some positive experiments (Fraser and Marcu, 2004) showing that F-Measure may be more useful, so perhaps a comparison based on F-Measure is ideal. Table 1 contains our results for the Hansards data. For the smaller Romanian data, we obtained similar behavior, but we leave out these results due (α, β) (1, 1) (d, yields the best F-Measure performance and is not far off in AER from the "fake" 2 IBM Model 2 (gotten by setting (α, β) = (d, 1)) whose results are in column 2 (the reason why we use this model at all is since it should be better than IBM 1: we want to know how far off we are from this obvious improvement). Moreover, we note that dice does not lead to quality β exponents and that, unfortunately, combining methods as in column 5 ((α, β) = (d, 1 − d)) does not necessarily lead to additive gains in AER and F-Measure performance.
2 Generally speaking, when using with d constant we cannot use Theorem 3 since h is linear. Most likely, the strict concavity of the model will hold because of the asymmetry introduced by the d term; however, there will be a necessary dependency on the data set.

Comparison with Previous Work
In this section we take a moment to also compare our work with the classical IBM 1 work of (Moore, 2004). Summarizing (Moore, 2004), we note that this work improves substancially upon the classical IBM Model 1 by introducing a set of heuristics, among which are to (1) modify the lexical parameter dictionaries (2) introduce an initialization heuristic (3) modify the standard IBM 1 EM algorithm by introducing smoothing (4) tune additional parameters. However, we stress that the main concern of this work is not just heuristicbased empirical improvement, but also structured learning. In particular, although using an regularizer l 2 and the methods of (Moore, 2004) would yield a strictly concave version of IBM 1 as well (with improvements), it is not at all obvious how to choose the learning rate or set the penalty on the lexical parameters. The goal of our work was to offer a new, alternate form of regularization. Moreover, since we are changing the original loglikelihood, our method can be thought of as way of bringing the l 2 regularizer inside the log likelihood. Like (Moore, 2004), we also achieve appreciable gains but have just one tuning parameter (when β = 1 − d we just have the centering λ parameter) and do not break the probabilistic interpretation any more than appending a regularizer would (our method modifies the log-likelihood but the simplex constrains remain).

Conclusion
In this paper we showed how IBM Model 1 can be made into a strictly convex optimization problem via functional composition. We looked at a specific member within the studied optimization family that allows for an easy EM algorithm. Finally, we conducted experiments showing how the model performs on some standard data sets and empirically showed 30% important over the standard IBM Model 1 algorithm. For further research, we note that picking the optimal h i,j,k is an open question, so provably finding and justifying the choice is one topic of interest.