Multi-Task Word Alignment Triangulation for Low-Resource Languages

We present a multi-task learning approach that jointly trains three word alignment models over disjoint bitexts of three languages: source, target and pivot. Our approach builds upon model triangulation, following Wang et al., which approximates a source-target model by combining source-pivot and pivot-target models. We develop a MAP-EM algorithm that uses triangulation as a prior, and show how to extend it to a multi-task setting. On a low-resource Czech-English corpus, using French as the pivot, our multi-task learning approach more than doubles the gains in both F-and B leu scores compared to the interpolation approach of Wang et al. Further experiments reveal that the choice of pivot language does not signiﬁcantly a ﬀ ect performance.


Introduction
Word alignment (Brown et al., 1993;Vogel et al., 1996) is a fundamental task in the machine translation (MT) pipeline. To train good word alignment models, we require access to a large parallel corpus. However, collection of parallel corpora has mostly focused on a small number of widely-spoken languages. As such, resources for almost any other pair are either limited or non-existent.
To improve word alignment and MT in a lowresource setting, we design a multitask learning approach that utilizes parallel data of a third language, called the pivot language ( §3). Specifically, we derive an efficient and easy-to-implement MAP-EM-like algorithm that jointly trains sourcetarget, source-pivot and pivot-target alignment models, each on its own bitext, such that each model benefits from observations made by the other two.
Our method subsumes the model interpolation approach of Wang et al. (2006), who independently train these three models and then interpolate the source-target model with an approximate sourcetarget model, constructed by combining the sourcepivot and pivot-target models.
Pretending that Czech-English is low-resource, we conduct word alignment and MT experiments ( §4). With French as the pivot, our approach significantly outperforms the interpolation method of Wang et al. (2006) on both alignment F-and Bleu scores. Somewhat surprisingly, we find that our approach is insensitive to the choice of pivot language.
2 Triangulation and Interpolation Wang et al. (2006) focus on learning a word alignment model without a source-target corpus. To do so, they assume access to both source-pivot and pivot-target bitexts on which they independently train a source-pivot word alignment model Θ sp and a pivot-target model Θ pt . They then combine the two models by marginalizing over the pivot language, resulting in an approximate source-target model Θ st . This combination process is referred to as triangulation (see §5).
In particular, they construct the triangulated source-target t-table t st from the source-pivot and pivot-target t-tables t sp , t pt using the following approximation: Subsequently, if a source-target corpus is available, they train a standard source-target model Θ st , and tune the interpolation with respect to λ interp to reduce alignment error rate (Koehn, 2005) over a hand-aligned development set. Wang et al. (2006) propose triangulation heuristics for other model parameters; however, in this paper, we consider only t-table triangulation.

Our Method
We now discuss two approaches that better exploit model triangulation. In the first, we use the triangulated t-table to construct a prior on the source-target t-table. In the second, we place a prior on each of the three models and train them jointly.

Triangulation as a Fixed Prior
We first propose to better utilize the triangulated ttable t st (Eq. 1) by using it to construct an informative prior for the source-target t-table t st ∈ Θ st .
Specifically, we modify the word alignment generative story by placing Dirichlet priors on each of the multinomial t-table distributions t st (· | s): Here, each α s = (. . . , α st , . . .) denotes a hyperparameter vector which will be defined shortly.
Fixing this prior, we optimize the model posterior likelihood P(Θ st | bitext st ) to find a maximum-aposteriori (MAP) estimate. This is done according the MAP-EM framework (Dempster et al., 1977), which differs slightly from standard EM. The Estep remains as is: fixing the model Θ st , we collect expected counts E[c(s, t)] for each decision in the generative story. The M-step is modified to maximize the regularized expected complete-data loglikelihood with respect to the model parameters Θ st , where the regularizer corresponds to the prior.
Due to the conjugacy of the Dirichlet priors with the multinomial t-table distributions, the sole modification to the regular EM implementation is in the M-step update rule of the t-table parameters: where E[c(s, t)] is the expected number of times source word s aligns with target word t in the sourcetarget bitext. Moreover, through Eq. 3, we can view α st − 1 as a pseudo-count for such an alignment.
To define the hyperparameter vector α s we decompose it as follows: where C s > 0 is a scalar parameter, m s is a probability vector, encoding the mode of the Dirichlet and 1 denotes an all-one vector. Roughly, when C s is high, samples drawn from the Dirichlet are likely to concentrate near the mode m s . Using this decomposition, we set for all s: where c(s) is the count of source word s in the source-target bitext, and the scalar hyperparameters λ, γ > 0 are to be tuned (We experimented with completely eliminating the hyperparameters γ, λ by directly learning the parameters C s . To do so, we implemented the algorithm of Minka (2000) for learning the Dirichlet prior, but only learned the parameters C s while keeping the means m s fixed to the triangulation. However, preliminary experiments showed performance degradation compared to simple hyperparameter tuning). Thus, the distribution t st (· | s) arises from a Dirichlet with mode t st (· | s) and will tend to concentrate around this mode as a function of the frequency of s. The hyperparameter λ linearly controls the strength of all priors. The last term in Eq. 6 keeps the sum of C s insensitive to γ, such that s C s = λ s c(s). In all our experiments we fixed γ = 0.5. Setting γ < 1 down-weights the parameter C s of frequent words s compared to rare ones. This makes the Dirichlet prior relatively weaker for frequent words, where we can let the data speak for itself, and relatively stronger for rare ones, where a good prior is needed.
Finally, note that this EM procedure reduces to an interpolation method similar to that of Wang et al. by applying Eq. 3 only at the very last M-step, with α s , m s as above and C s = λ t E[c(s, t)].

Joint Training
Next, we further exploit the triangulation idea in designing a multi-task learning approach that jointly trains the three word alignment models Θ st , Θ sp , and Θ pt .
To do so, we view each model's t-table as originating from Dirichlet distributions defined by the triangulation of the other two t-tables. We then train • For each EM iteration i: sp , Θ (i) pt using Eq. 7 as required the models in a MAP-EM like manner, updating both the model parameters and their prior hyperparameters at each iteration. Roughly, this approach aims at maximizing the posterior likelihood of the three models with respect to both model parameters and their hyperparameters (see Appendix). Procedurally, the idea is simple: In the E-step, expected counts E[c(·)] are collected from each model as usual. In the M-step, each t-table is updated according to Eq. 3 using the current expected counts E[c(·)] and an estimate of α from the triangulation of the most recent version of the other two models. See Algorithm 1.
Note, however, that we cannot obtain the triangulated t-tables t sp , t pt by simply applying the triangulation equation (Eq. 1). For example, to construct t sp we need both source-to-target and target-to-pivot distributions. While we have the former in t st , we do not have t tp . To resolve this issue, we simply approximate t tp from the reverse t-table t pt ∈ Θ pt as follows: where c(p) denotes the unigram frequency of the word p. A similar transformation is done on t sp to obtain t ps , which is then used in computing t pt .

Adjustment of the t-table
Note that a t-table resulting from the triangulation equation (Eq. 1) is both noisy and dense. To see why, consider that t st (t | s) is non-zero whenever there is a pivot word p that co-occurs with both s and t. This is very likely to occur, for example, if p is a function word.
To adjust for both density and noise, we propose a simple product-of-experts re-estimation that relies on the available source-target parallel data. The two experts are the triangulated t-table as defined by Eq. 1 and the exponentiated pointwise mutual information (PMI), derived from simple token co-occurrence statistics of the source-target bitext. That is, we adjust: and normalize the result to form valid conditional distributions.
Note that the sparsity pattern of the adjusted ttable matches that of a co-occurrence t-table. We applied this adjustment in all of our experiments.

Experimental Results
Pretending that Czech-English is a low-resource pair, we conduct two experiments. In the first, we set French as the pivot language and compare our fixedprior (Sec. §3.1) and joint training (Sec. §3.2) approaches against the interpolation method of Wang et al. and a baseline HMM word alignment model (Vogel et al., 1996).
In the second, we examine the effect of the pivot language identity on our joint training approach, varying the pivot language over French, German, Greek, Hungarian, Lithuanian and Slovak.

Data
For word alignment, we use the Czech-English News Commentary corpus, along with a development set of 460 hand aligned sentence pairs. For the MT experiments, we use the WMT10 tuning set (2051 parallel sentences), and both WMT09/10 shared task test sets. See Table 1.
For each of the 6 pivot languages, we created Czech-pivot and pivot-English bitexts of roughly the same size (ranging from 196k sentences for English-Greek to 223k sentences for Czech-Lithuanian). Each bitext was created by forming a Czech-pivot-English tritext, consisting of about 500k sentences from the Europarl corpus (Koehn, 2005) which was then split into two disjoint Czech-pivot and pivot-English bitexts of equal size. Sentences of length greater than 40 were filtered out from all training corpora.

Experiment 1: Method Comparison
We trained word alignment models in both sourceto-target and target-to-source directions. We used 5 iterations of IBM Model 1 followed by 5 iterations of HMM. We tuned hyperparameters to maximize alignment F-score of the hand-aligned development set. Both interpolation parameters λ interp and λ were tuned over the range [0, 1]. For our methods, we fixed γ = 0.5, which we found effective during preliminary experiments. Alignment F-scores using grow-diag-final-and (gdfa) symmetrization (Koehn, 2010) are reported in Table 2, column 2.
We conducted MT experiments using the Moses translation system (Koehn, 2005). We used a 5-gram LM trained on the Xinhua portion of English Gigaword (LDC2007T07). To tune the decoder, we used the WMT10 tune set. MT Bleu scores are reported in Table 2, columns 3-4.
Both our methods outperform the baseline and the interpolation approach. In particular, the joint training approach more than doubles the gains obtained by the interpolation approach, on both F-and Bleu.
We also evaluated the Czech-French and French-English alignments produced as a by-product of our joint method. While our French-to-English MT experiments showed no improvement in Bleu, we saw a +0.6 (25.6 to 26.2) gain in Bleuon the Czech-to-French translation task. This shows that joint training may lead to some improvements even on highresource bitexts.

Other Pivot Languages
We examined how the choice of pivot language affects the joint training approach by varying it over 6 languages (French, German, Greek, Hungarian, train dev WMT09 WMT10 sentences 85k 460 2525 2489 cz tokens 1.63M 9.7k 55k 53k en tokens 1.78M 10k 66k 62k   Lithuanian and Slovak), while keeping the size of the pivot language resources roughly the same.
Somewhat surprisingly, all models achieved an F-score of about 70%, which resulted in Bleu scores comparable to those reported with French (Table 2). Subsequently, we combined all pivot languages by simply concatenating the aligned parallel texts across pairs, triples and all pivot languages. Combining all pivots yielded modest Bleu score improvements of +0.2 and +0.1 on the test datasets (Table 3).
Considering the low variance in F-and Bleu scores across pivot languages, we computed the pairwise F-scores between the predicted alignments: All scores ranged around 97-98%, indicating that the choice of pivot language had little effect on the joint training procedure.
To further verify, we repeated this experiment over Greek-English and Lithuanian-English as the source-target task (85k parallel sentences), using the same pivot languages as above, and with comparable amounts of parallel data (∼200k sentences). We obtained similar results: In all cases, pairwise F-scores were above 97%.

Related Work
The term "triangulation" comes from the phrasetable triangulation literature (Cohn and Lapata, 2007;Razmara and Sarkar, 2013;Dholakia and Sarkar, 2014), in which source-pivot and pivot-target phrase tables are triangulated according to Eq. 1 (with words replaced by phrases). The resulting triangulated phrase table can then be combined with an existing source-target phrase table, and is especially useful in increasing the source language vocabulary coverage, reducing OOVs. In our case, since word alignment is a closed vocabulary task, OOVs are never an issue.
In word alignment, Kumar et al. (2007) uses multilingual parallel data to compute better sourcetarget alignment posteriors. Filali and Bilmes (2005) tag each source token and target token with their most likely translation in a pivot language, and then proceed to align (source word, source tag) tuple sequences to (target word, target tag) tuple sequences. In contrast, our word alignment method can be applied without multilingual parallel data, and does not commit to hard decisions.

Conclusion and Future Work
We presented a simple multi-task learning algorithm that jointly trains three word alignment models over disjoint bitexts. Our approach is a natural extension of a mathematically sound MAP-EM algorithm we originally developed to better utilize the model triangulation idea. Both algorithms are easy to implement (with closed-form solutions for each step) and require minimal effort to integrate into an EM-based word alignment system. We evaluated our methods on a low-resource Czech-English word alignment task using additional Czech-French and French-English corpora. Our multi-task learning approach significantly improves F-and Bleu scores compared to both baseline and the interpolation method of Wang et al. (2006). Further experiments showed our approach is insensitive to the choice of pivot language, producing roughly the same alignments over six different pivot language choices.
For future work, we plan to improve word alignment and translation quality in a more data restricted case where there are very weak source-pivot resources: for example, word alignment of Malagasy-English via French, using only a Malagasy-French dictionary, or Pashto-English via Persian.
Intuitively, α spt represents the number of times a source-pivot-target triplet (s, p, t) was observed.
With this prior, we can maximize the posterior likelihood of the three models given the three bitexts (denoted data = {bitext st , bitext sp , bitext pt }) with respect to all parameters and hyperparameters: arg max Θ,α P(Θ | α, data) = arg max Θ,α d∈{st,sp,pt} P(bitext d | Θ d )P(Θ d | α) Under the generative story, we need only observe the marginals α s·t , α sp· , α ·pt of α. Therefore, instead of explicitly optimizing over α, we can optimize over the marginals while keeping them consistent (via constraints such as t α s·t = p α sp· for all s).
In our joint training algorithm (Algorithm 1) we abandon these consistency constraints in favor of closed form estimates of the marginals α s·t , α sp· , α ·pt .