Model Invertibility Regularization: Sequence Alignment With or Without Parallel Data

We present Model Invertibility Regularization ( MIR ), a method that jointly trains two directional sequence alignment models, one in each direction, and takes into account the invertibility of the alignment task. By coupling the two models through their parameters (as opposed to through their inferences, as in Liang et al.’s Alignment by Agreement ( ABA ), and Ganchev et al.’s Posterior Regularization ( PostCAT )), our method seamlessly extends to all IBM-style word alignment models as well as to alignment without parallel data. Our proposed algorithm is mathematically sound and inher-its convergence guarantees from EM. We evaluate MIR on two tasks: (1) On word alignment, applying MIR on fertility based models we attain higher F-scores than ABA and PostCAT . (2) On Japanese-to-English back-transliteration without parallel data, applied to the decipherment model of Ravi and Knight, MIR learns sparser models that close the gap in whole-name error rate by 33% relative to a model trained on parallel data, and further, beats a previous approach by Mylonakis et al.


Introduction
The transfer of information between languages is a common natural language phenomenon that is intuitively invertible. For example, in transliteration, a source-language word is mapped to a target language's writing system under a sound preserving mapping (for example, "computer" to Japanese Romaji, "konpyutaa"). The original word should then be recoverable from its transliterated version. Similarly, in translation, the back-translation of the translation of a word is likely to be that same word itself.
In NLP, however, commonly-used generative models describing such phenomena are directional, only concerned with the transfer of source-language symbols to target-language symbols or vice versa, but not both directions. Left unchecked, independently training two such directional models (sourceto-target and target-to-source) often yields two models that diverge from this invertibility intuition.
In word alignment, this can lead to disagreements between alignments inferred by a model trained in one direction and those inferred by a model trained in the reverse direction. To remedy this disparity (and other shortcomings), it is common to turn to alignment symmetrization techniques such as growdiag-final-and (Koehn et al., 2003) which heuristically combines alignments from both directions. Liang et al. (2006) suggest a more fundamental approach they call Alignment by Agreement (ABA), which jointly trains two word alignment models by maximizing their data-likelihoods along with a regularizer that rewards agreement between their alignment posteriors (computed over each parallel sentence pair). Although their EM-like optimization procedure is heuristic, it proves effective at jointly training bidirectional models. Ganchev et al. (2008) propose another approach for agreement between the directed models by adding constraints on the alignment posteriors. Unlike ABA, their optimization is exact, but it can be computationally expensive, requiring multiple forward-backward inferences in each E-step.
In this paper we develop a different approach for jointly training general bidirectional sequence alignment models called Model Invertibility Regularization, or MIR (Section 3). Our approach has two key benefits over ABA and PostCAT: First, MIR can be applied to sequence alignment without parallel data. Second, a single implementation seamlessly extends to all IBM models, including the fertility based models. Furthermore, since MIR follows the MAP-EM framework, it inherits its desirable convergence guarantees.
The key idea facilitating the easy extension to complex models and to non-parallel data settings is in our regularizer, which operates on the model parameters as opposed to their inferences. Specifically, MIR was designed to reward model pairs whose translation tables respect the invertibility intuition.
We tested MIR against competitive baselines on two sequence alignment tasks: word alignment (with parallel data) and back-transliteration decipherment (without parallel data).
On Czech-English and Chinese-English word alignment (Section 5), restricted to the HMM model, MIR attains F-and Bleu score improvements that are comparable to those of ABA and PostCAT. We further apply MIR beyond HMM, on the fertility-based IBM Models, showing further gains in F-score compared to the baseline, ABAandPostCAT. Interestingly, the HMM alignments obtained by ABA and MIR are qualitatively different, so that combining the two yields additive gains over each method by itself.
On English-Japanese back-transliteration decipherment (Section 6), we apply MIR to the cascade of wFSTs approach proposed by Ravi and Knight (2009). Using MIR, we are able to reduce the wholename error-rate relative to a model trained on parallel data by 33%, as well as significantly outperform the joint model proposed by Mylonakis et al. (2007).

Background
We are concerned with learning generative models that describe transformations of a source-language sequence e = (e 1 , . . . , e I ) to a target-language sequence f = ( f 1 , . . . , f J ). We consider two different data scenarios.
In the parallel data setting, each sample in the observed data consists of a pair (e, f). The generative story assigns the following probability to the event that f arises from e: where Θ denotes the model parameters and a denotes a hidden variable that corresponds to unknown choices taken in the generative process.
In the non-parallel data setting, only the target sequence f is observed and the source sequence e is hidden. The model assigns the following probability to the observed data: That is, the sequence f can arise from any sequence e by first selecting e ∼ p(e) and then proceeding according to the parallel-data generative story (Eq. 1).
Unsupervised training of such models entails maximizing the data log-likelihood L(Θ): where X = {(e n , f n )} n in the parallel data setting and X = {(f n )} n in the non-parallel data setting.
Although the structure of Θ is unspecified, in practice, most models that follow these generative stories contain a word translation table (t-table) denoted t, with each parameter t( f | e) representing the conditional probability of mapping a given source symbol e to a target symbol f .

Model Invertibility Regularization
In this section we propose a method for jointly training two word alignment models, a source-to-target model Θ 1 and a target-to-source model Θ 2 , by regularizing their parameters to respect the invertibility of the alignment task. We therefore name our method Model Invertibility Regularization (MIR).

Regularizer
Our regularizer operates on the t-table parameters t 1 , t 2 of the two models, as follows: Let matrices T 1 , T 2 denote the t-tables t 1 , t 2 in matrix form and consider their multiplication T = T 1 T 2 . The resulting matrix T is a stochastic square matrix of dimension |V 1 | × |V 1 | where |V 1 | denotes the size of the source-language vocabulary. Each entry T i j represents the total probability mass mapped from source word e i to source word e j by first applying the source-to-target mapping T 1 and then the targetto-source mapping T 2 .
In particular, each diagonal entry T ii holds the probability of mapping a source symbol back onto itself, a quantity we intuitively believe should be high. We therefore (initially) consider maximizing the trace of T : We further note that Tr[T ] = Tr[T 1 T 2 ] = Tr[T 2 T 1 ], so that the trace captures equally well how much the target symbols map onto themselves. Since T is stochastic, setting it to the identity matrix I maximizes its trace. In other words, the more T 1 and T 2 behave as (pseudo-)inverses of each other, the higher the trace is. This exactly fits with our intuition regarding invertibility.
Unfortunately, the trace is not concave in both T 1 and T 2 , a property which will become desirable in optimization. We therefore modify the trace regularizer by applying the entrywise square root operator on T 1 , T 2 and denote the new term R: Note that R is maximized when √ T 1 √ T 2 = I. Concavity of R in both t 1 , t 2 (or equivalently T 1 , T 2 ) follows by observing that it is a sum of concave functions -each term in the summation is a geometric mean, which is concave in its parameters.

Joint Objective Function
We apply MIR in two data scenarios: In the parallel data setting, we observe N sequence pairs {x n 1 } n = {(e n , f n )} n or, equivalently, {x n 2 } n = {(f n , e n )} n . In the non-parallel setting, two monolingual datasets are observed: N 1 source sequences {x n 1 } n = {e n } n and N 2 target sequences {x n 2 } n = {f n } n . The probability of the nth sample under the kth model Θ k (for k ∈ {1, 2}) is denoted p k (x n k ; Θ k ). Specifically, in the parallel data setting, the probability of x n k under its model is: 1 whereas in the non-parallel data setting, the probability is defined as: p 1 (x n 1 ; Θ 1 ) = p(f n ; Θ 1 ) p 2 (x n 2 ; Θ 2 ) = p(e n ; Θ 2 ).
Using the above definitions and the MIR regularizer R (Eq. 3), we formulate an optimization program for maximizing the regularized log-likelihoods of the observed data: where λ ≥ 0 is a tunable hyperparameter (note that, in the parallel case, N = N 1 = N 2 ). We defer discussion on the relationship and merits of our approach with respect to ABA (Liang et al., 2006) and PostCAT (Ganchev et al., 2008) to Section 4.

Optimization Procedure
Using our concave regularizer, MIR optimization (Eq. 4) neatly falls under the MAP-EM framework (Dempster et al., 1977) and inherits the convergence properties of the underlying algorithms. MAP-EM follows the same structure as standard EM: The E-step remains identical to the standard E-step, while the M-step maximizes the complete-data loglikelihood plus the regularization term. In the case of MIR, the E-step can be carried out independently for each model. The only extra work is in the Mstep, which optimizes a single (concave) objective function.
Specifically, let z n denote the missing data, where, in the parallel data setting, only the alignment is missing (z n k = a n k ) and in the non-parallel data setting, both alignment and source symbol are missing (z n 1 = (a n 1 , e n ), z n 2 = (a n 2 , f n )). In the E-step, each model Θ k (for k ∈ {1, 2}) is held fixed and its posterior distribution over the missing data z n k is computed per each observation, x n k : In the M-step, the computed posteriors are used to define a convex optimization program that max-imizes the regularized sum of expected completedata log-likelihoods: where n ranges over the appropriate sample set.
Operationally, for models Θ k that can be encoded as wFSTs (such as the IBM1, IBM2 and HMM word alignment models), the E-step can be carried out efficiently and exactly using dynamic programming (Eisner, 2002). Other models resort to approximation techniques -for example, the fertilitybased word alignment models apply hill-climbing and sampling heuristics in order to efficiently estimate the posteriors (Brown et al., 1993) From the computed posteriors q k we collect expected counts for each event, used to construct the M-step optimization objective. Since the MIR regularizer couples only the t-table parameters, the update rule for any remaining parameter is left unchanged (that is, one can use the usual closed-form count-and-divide solution). Now, let C e, f 1 and C e, f 2 denote the expected counts for the t-table parameters. That is, C e, f k denotes the expected number of times a source-symbol type e is seen aligned to a target-symbol type f according to the posterior q k . In the M-step, we maximize the following objective with respect to t 1 and t 2 : which can be efficiently solved using convex programming techniques due to the concavity of R and the complete-data log-likelihoods in both t 1 and t 2 .
In our implementation, we applied Projected Gradient Descent (Bertsekas, 1999;Schoenemann, 2011), where at each step, the parameters are updated in the direction of the M-step objective gradient at (t 1 , t 2 ) and then projected back onto the probability simplex. We used simple stopping conditions based on objective function value convergence and a bounded number of iterations.

Parallel Data Baseline: ABA and PostCAT
Our approach is most similar to Alignment by Agreement (Liang et al., 2006) which uses a single joint objective for two word alignment models. The difference between our objective (Eq. 4) and theirs lies in their proposed regularizer, which rewards the per-sample agreement of the two models' alignment posteriors: where x n = (e n , f n ) and where z ranges over the possible alignments between e n and f n (practically, only over 1-to-1 alignments, since each model is only capable of producing one-to-many alignments). Liang et al. (2006) note that proper EM optimization of their regularized joint objective leads to an intractable E-step. Unable to exactly and efficiently compute alignment posteriors, they resort to a product-of-marginals heuristic which breaks EM's convergence guarantees, but has a closed-form solution and works well in practice.
MIR regularization has both theoretical and practical advantages compared to ABA, which make our method more convenient and broadly applicable: 1. By regularizing for posterior agreement, ABA is restricted to a parallel data setting, whereas MIR can be applied even without parallel data.
2. The posteriors of more advanced word alignment models (such as fertility-based models) do not correspond to alignments, and furthermore, are already estimated with approximation techniques. Thus, even if we somehow adapt ABA's product-of-marginals heuristic to such models, we run the risk of estimating highly inaccurate posteriors (specifically, zero-valued posteriors). In contrast, MIR extends to all IBM-style word alignment models and does not add heuristics. The M-step computation can be done exactly and efficiently with convex optimization.
3. MIR provides the same theoretical convergence guarantees as the underlying algorithms. Ganchev et al. (2008) propose PostCAT which uses Posterior Regularization (Ganchev et al., 2010) to enforce posterior agreement between the two models. Specifically, they add a KL-projection step after the E-step of the EM algorithm which returns the posterior q(z | x) closest in KL-Divergence to an E-step posterior, but which also upholds certain constraints. The particular constraints they suggest encode alignment agreement in expectation between the two models' posteriors. For details, the reader can refer to (Ganchev et al., 2008).
Similarly to ABA, with their suggested alignment agreement constraints PostCAT cannot be applied without parallel data and it is unclear how to extend it to fertility based models (however, it does seems possible to apply other constraints using the general posterior regularization framework).
We compare MIR against ABA and PostCAT in Section 5.

Non-Parallel Data Baseline: bi-EM
Mylonakis et al. (2007) cast the two directional models as a single joint model by reparameterization and normalization. That is, both directional models, consisting of a t-table only, are reparameterized as: They then maximize the likelihood of observed monolingual sequences from both languages: where, for example: Here, p(e) denotes the probability of e according to a fixed source language model. Once training of β is complete, we can decode an observed target sequence f, by casting β back in terms of t 1 and apply the Viterbi decoding algorithm.
To solve for β in Eq. 7, Mylonakis et al. (2007) propose bi-EM, an iterative EM-style algorithm. The objective function in their M-step is not concave, hinting that a closed-form solution for the maximizer is unlikely. The probability estimate that they use in the M-step appears to maximize an approximation of their M-step objective which omits the normalization factors in Eq. 7.
Nevertheless, bi-EM attains improved results compared to standard EM on both POS-tagging and monotone noun sequence translation without parallel data. We compare MIR against bi-EM in Sec. 6.

Experiments with Parallel Data
In this section, we compare MIR against standard EM training and ABA on Czech-English and Chinese-English word alignment and translation.

Implementation and Code
For ABA 2 and PostCAT 3 training we used the authors' implementation, which supports the HMM model.
Vanilla EM training was done using GIZA++, 4 which supports all IBM models as well as HMM. Our method MIR was implemented on top of GIZA++. 5

Data
We used the following parallel data to train the word alignment models: Chinese-English: 287K sentence pairs from the NIST 2009 Open MT Evaluation constrained task consisting of 5.3M and 6.6M tokens, respectively.
Czech-English: 85K sentence pairs from the News Commentary corpus, consisting of 1.6M and 1.8M tokens, respectively.
Sentence length was restricted to at most 40 tokens.

Word Alignment Experiments
We obtained HMM alignments by running either 5 or 10 iterations (optimized on a held-out validation set) of both IBM Model 1 and HMM. We obtained IBM Model 4 alignments by continuing with 5 iterations of IBM Model 3 and 10 iterations of IBM Model 4. We then extracted symmetrized alignments in the following manner: For all HMM models, we used the posterior decoding technique described in Liang et al. (2006) as implemented by each package. For IBM Model 4, we used the standard grow-diag-final-and (gdfa) symmetrization heuristic (Koehn et al., 2003). We tuned MIR's λ parameter to maximize alignment F-score on a validation set of 460 hand-aligned Czech-English and 1102 Chinese-English sentences.
Alignment F-scores are reported in Table 1. In particular, the best results were obtained by MIR, when applied to the fertility based IBM4 modelwe obtained gains of +2.1% (Chinese-English) and +0.3% (Czech-English) compared to the best competitor.

MT Experiments
We ran MT experiments using the Moses (Koehn et al., 2007) phrase-based translation system. 6 The feature weights were trained discriminatively using MIRA (Chiang et al., 2008), and we used a 5-gram language model trained on the Xinhua portion of English Gigaword (LDC2007T07). All other parameters remained with their default settings. The development data used for discriminative training were: for Chinese-English, data from the NIST 2004 and NIST 2006 test sets; for Czech-English, 2051 sentences from the WMT 2010 shared task. We used case-insensitive IBM Bleu (closest reference length) as our metric.
On both language pairs, ABA, PostCAT and MIR outperform their respective EM baseline with comparable gains overall. However, we noticed that ABA and MIR are not producing the same alignments. For example, by combining their HMM alignments (simply concatenating aligned bitexts) the total improvement reaches +1.5 Bleu on the Chinese-to-English task, a statistically significant improvement (p < 0.05) according to a bootstrap resampling significance test (Koehn, 2004)). 6 Experiments without Parallel Data Ravi and Knight (2009) consider the challenging task of learning a Japanese-English backtransliteration model without parallel data. The goal is to correctly decode a list of 100 US senator names written in katakana script, without having access to parallel data. In this section, we reproduce their decipherment experiment and show that applying MIR to their baseline model significantly outperforms both the baseline and the bi-EM method.

Phonetic-Based Japanese Decipherment
Ravi and Knight (2009) construct a English-to-Japanese transliteration model as a cascade of wF-STs (depicted in Figure 1, top). According to their generative story, any word in katakana is generated by re-writing an English word in its English phonetic representation, which is then transformed to a Japanese phonetic representation and finally re-written in katakana script. For example, the word "computer" is mapped to a sequence of 8 English phonemes (k, ah, m, p, y, uw, t, er), which is mapped to a sequence of 9 Japanese phonemes (K, O, N, P, Y, U, T, A, A) and finally to Katakana. They apply their trained transliteration model to decode a list of 100 US senator names and report a whole-name error-rate (WNER) 7 of 40% with parallel data (trained over 3.3k word pairs), compared to 73% WNER without parallel data (trained over 9.5k Japanese words only), demonstrating the weakness of methods that do not use parallel data.

Forward Pipeline
We reproduced the English-to-Japanese transliteration pipeline of Ravi and Knight (2009) by constructing each of the cascade wFSTs as follows: 1. A unigram language model (LM) of English terms, estimated over the top 40K most frequent capitalized words found in the Gigaword corpus (without smoothing).
2. An English pronunciation wFST from the CMU pronunciation dictionary. 8 3. An English-to-Japanese phoneme mapping wFST that encodes a phoneme t-table t 1 which was designed according to the best setting reported by Ravi and Knight (2009). Specifically, t 1 is restricted to either 1-to-1 or 1-to-2 phoneme mappings and maintains consonant parity. See further details in their paper.

A hand-built Japanese pronunciation to
Katakana wFST (Ravi and Knight, 2009).

Backward Pipeline
MIR requires a pipeline in the reverse direction, transliteration of Japanese to English. We constructed a unigram LM of Katakana terms over the top 25K most frequent Katakana words found in the Japanese 2005-2008-news dictionary from the Leipzig corpora. 9 The remaining required wFSTs were obtained by inverting the forward model wFSTs (that is, wFSTs 2,3,4 above), and the cascade was composed in the reverse direction. In particular, by inverting t 1 , we obtained the Japanese-to-English t-table t 2 that allows only 2-to-1 or 1-to-1 phoneme mappings.

Training Data
For training data, we used the top 50% most frequent terms from the monolingual data over which we constructed the LM wFSTs. This resulted in a set of 20K English terms (denoted ENG) and a set of 13K Japanese terms in Katakana (denoted KTKN).
Taking the entire set of monolingual terms led to poor baseline results, probably since uncommon English terms are not transliterated, and uncommon Katakana terms may be borrowed from languages other than English.
In any case, it is important to note that ENG and KTKN are unrelated, since both were collected over non-parallel corpora.

Training and Tuning
We train and tune 4 models: baseline: the model proposed by Ravi and Knight (2009), which maximizes the likelihood (Eq. 2) of the observed Japanese terms KTKN.
Oracle: As an upper bound, we train the model of Ravi and Knight (2009) as if it was given the correct English origin for each Japanese term. (over 4.2K parallel English-Japanese phoneme sequences).
We train each method for 15 EM iterations, while keeping the LM and pronunciation wFSTs fixed.
Training was done using the Carmel finite-state toolkit. 10 Specifically, baseline and oracle rely on Carmel exclusively, while for MIR and bi-EM, we manipulated Carmel to output the E-step posteriors, which we then used to construct and solve the Mstep objective using our own implementation.
The different models were tuned over a development set consisting of 50 frequent Japanese terms and their English origin. For each method, we chose the so-called stretch-factor α ∈ {1, 2, 3} used to exponentiate the model parameters before decoding 10 http://www.isi.edu/licensed-sw/carmel/ Figure 1: The transliteration generative story as a cascade of wFSTs. Each box represents a transducer. Top: transliteration of the word "computer" to Japanese Katakana. Bottom: the reverse process. MIR jointly trains the two cascades by maximizing the regularized data log-likelihood with respect to the two (shaded) phoneme mapping models t 1 , t 2 . (see Ravi and Knight (2009)), our model's hyperparameter λ ∈ {1, 2, 3, 4}, and the number of iterations (up to 15) to minimize WNER on the development set. We decoded Japanese terms using the Viterbi algorithm, applied on the selected t 1 model (using Eq. 6 to convert the bi-EM model β back to to t 1 ). Finally, note that ABA training and symmetrization decoding heuristics are inapplicable, since they rely on parallel data.

Senator Name Decoding Results
We compiled our own test set, consisting of 100 US senator names (first and last), and compared the performance of the four algorithms. Table 3 reports WNER, average normalized edit distance (NED) and the number of model parameters (t 1 ) with value greater than 0.01 as an indication of sparsity. Figure  2 further compares the 1-to-1 portions of the best model learned by the baseline method with the best model learned by MIR, showing the difference in parameter sparsity.  Table 3: MIR reduces error rates (WNER, NED) and learns sparser models (number of t 1 parameters greater than 0.01) compared to the other models.
Using MIR, we obtained significant reduction in error rates, closing the gap between the baseline method and Oracle, which was trained on parallel data, by 33% in WNER and nearly 50% in NED. This error reduction clearly demonstrates the efficacy of MIR in the non-parallel data setting.

Conclusion
We presented Model Invertibility Regularization (MIR), an unsupervised method for jointly training bidirectional sequence alignment models with or without parallel data. Our formulation is based on the simple observation that the alignment tasks at hand are inherently invertible and encourages the translation tables in both models to behave like pseudo-inverses of each other.
We derived an efficient MAP-EM algorithm and demonstrated our method's effectiveness on two different alignment tasks. On word alignment, applying MIR on the IBM4 model yielded the highest F scores and the resulting Bleu scores were comparable to that of Alignment by Agreement (Liang et al., 2006) and PostCAT (Ganchev et al., 2008). Our best MT results (up to +1.5 Bleu improvement) were obtained by combining alignments from both MIR and ABA, indicating that the two methods learn complementary alignments. On Japanese-English backtransliteration with no parallel data, we obtained a significant error reduction over two baseline methods (Ravi and Knight, 2009;Mylonakis et al., 2007).
As future work, we plan to apply MIR on largescale MT decipherment (Ravi and Knight, 2011;Dou and Knight, 2013), where, so far, only a single directional model has been used. Another promising direction is to encourage invertibility not only between words, but between their senses and synonyms.