Improving Pivot Translation by Remembering the Pivot

Pivot translation allows for translation of language pairs with little or no parallel data by introducing a third language for which data exists. In particular, the triangulation method, which translates by combining source-pivot and pivot-target translation models into a source-target model, is known for its high translation accuracy. However, in the conventional triangulation method, information of pivot phrases is forgotten and not used in the translation process. In this paper, we propose a novel approach to remember the pivot phrases in the triangulation stage, and use a pivot language model as an additional information source at translation time. Experimental results on the Europarl corpus showed gains of 0.4-1.2 BLEU points in all tested combinations of languages 1 .


Introduction
In statistical machine translation (SMT) (Brown et al., 1993), it is known that translation with models trained on larger parallel corpora can achieve greater accuracy (Dyer et al., 2008). Unfortunately, large bilingual corpora are not readily available for many language pairs, particularly those that don't include English. One effective solution to overcome the scarceness of bilingual data is to introduce a pivot language for which parallel data with the source and target languages exists (de Gispert and Mariño, 2006).
Among various methods using pivot languages, the triangulation method (Cohn and Lapata, 2007;Utiyama and Isahara, 2007;Zhu et al., 2014), which translates by combining source-pivot and pivot-target translation models into a source-target model, has been shown to be one of the most effective approaches. However, word sense ambiguity and interlingual differences of word usage cause difficulty in accurately learning correspondences between source and target phrases. Figure 1 (a) shows an example of three words in German and Italian that each correspond to the English polysemic word "approach." In such a case, finding associated source-target phrase pairs and estimating translation probabilities properly becomes a complicated problem. Furthermore, in the conventional triangulation method, information about pivot phrases that behave as bridges between source and target phrases is lost after learning phrase pairs, as shown in Figure 1 To overcome these problems, we propose a novel triangulation method that remembers the pivot phrase connecting source and target in the records of phrase/rule table, and estimates a joint translation probability from the source to target 573 and pivot simultaneously. We show an example in Figure 1 (c). The advantage of this approach is that generally we can obtain rich monolingual resources in pivot languages such as English, and SMT can utilize this additional information to improve the translation quality.
To utilize information about the pivot language at translation time, we train a Multi-Synchronous Context-free Grammar (MSCFG) (Neubig et al., 2015), a generalized extension of synchronous CFGs (SCFGs) (Chiang, 2007), that can generate strings in multiple languages at the same time.
To create the MSCFG, we triangulate source-pivot and pivot-target SCFG rule tables not into a single source-target SCFG, but into a source-target-pivot MSCFG rule table that remembers the pivot. During decoding, we use language models over both the target and the pivot to assess the naturalness of the derivation. We perform experiments on pivot translation of Europarl proceedings, which show that our method indeed provide significant gains in accuracy (of up to 1.2 BLEU points), in all combinations of 4 languages with English as a pivot language.

Synchronous Context-free Grammars
First, we cover SCFGs, which are widely used in machine translation, particularly hierarchical phrase-based translation (Hiero; Chiang (2007)).
In SCFGs, the elementary structures are rewrite rules with aligned pairs of right-hand sides: where X is the head of the rewrite rule, and s and t are both strings of terminals and non-terminals in the source and target side respectively. Each string in the right side tuple has the same number of indexed non-terminals, and identically indexed nonterminals correspond to each-other. For example, a synchronous rule could take the form of: In the SCFG training method proposed by Chiang (2007), SCFG rules are extracted based on parallel sentences and automatically obtained word alignments. Each extracted rule is scored with phrase translation probabilities in both directions φ(s|t) and φ(t|s), lexical translation probabilities in both directions φ lex (s|t) and φ lex (t|s), a word penalty counting the terminals in t, and a constant phrase penalty of 1.
At translation time, the decoder searches for the target sentence that maximizes the derivation probability, which is defined as the sum of the scores of the rules used in the derivation, and the log of the language model probability over the target strings. When not considering an LM, it is possible to efficiently find the best translation for an input sentence using the CKY+ algorithm (Chappelier et al., 1998). When using an LM, the expanded search space is further reduced based on a limit on expanded edges, or total states per span, through a procedure such as cube pruning (Chiang, 2007).

Multi-Synchronous CFGs
MSCFGs (Neubig et al., 2015) are a generalization of SCFGs that are be able to generate sentences in multiple target languages simultaneously. The single target side string t in the SCFG production rule is extended to have strings for N target languages: Performing multi-target translation with MSCFGs is quite similar to translating using standard SCFGs, with the exception of the expanded state space caused by having one LM for each target. Neubig et al. (2015) propose a sequential search method, that ensures diversity in the primary target search space by first expanding with only primary target LM, then additionally expands the states for other LMs, a strategy we also adopt in this work.
In the standard training method for MSCFGs, the multi-target rewrite rules are extracted from multilingual line-aligned corpora by applying an extended version of the standard SCFG rule extraction method, and scored with features that consider the multiple targets. It should be noted that this training method requires a large amount of line-aligned training data including the source and all target languages. This assumption breaks down when we have little parallel data, and thereby we propose a method to generate MSCFG rules by triangulating 2 SCFG rule tables in the following section.

Pivot Translation Methods
Several methods have been proposed for SMT using pivot languages. These include cascade methods that consecutively translate from source to pivot then pivot to target (de Gispert and Mariño, 2006), synthetic data methods that machinetranslate the training data to generate a pseudoparallel corpus (de Gispert and Mariño, 2006), and triangulation methods that obtain a sourcetarget phrase/rule table by merging source-pivot and pivot-target table entries with identical pivot language phrases (Cohn and Lapata, 2007). In particular, the triangulation method is notable for producing higher quality translation results than other pivot methods (Utiyama and Isahara, 2007), so we use it as a base for our work.

Traditional Triangulation Method
In the triangulation method by Cohn and Lapata (2007), we first train source-pivot and pivot-target rule tables, then create rules: if there exists a pivot phrase p such that the pair ⟨s, p⟩ is in source-pivot table T SP and the pair p, t is in pivot-target table T P T . Source-target table T ST is created by calculation of the translation probabilities using phrase translation probabilities φ(·) and lexical translation probabilities φ lex (·) for all connected phrases according to the following equations (Cohn and Lapata, 2007): The equations (5)-(8) are based on the memoryless channel model, which assumes φ t|p, s = φ t|p and φ s|p, t = φ (s|p). Unfortunately, these equations are not accurate due to polysemy and disconnects in the grammar of the languages. As a result, pivot translation is significantly more ambiguous than standard translation.

Proposed Triangulation Method
To help reduce this ambiguity, our proposed triangulation method remembers the corresponding pivot phrase as additional information to be utilized for disambiguation. Specifically, instead of marginalizing over the pivot phrase p, we create an MSCFG rule for the tuple of the connected sourcetarget-pivot phrases such as: X → s, t, p . (9) The advantage of translation with these rules is that they allow for incorporation of additional features over the pivot sentence such as a strong pivot LM.
In addition to the equations (5)-(8), we also estimate translation probabilities φ(t, p|s), φ(s|p, t) that consider both target and pivot phrases at the same time according to: Translation probabilities between source and pivot phrases φ(p|s), φ(s|p), φ lex (p|s), φ lex (s|p) can also be used directly from the source-pivot rule table. This results in 13 features for each MSCFG rule: 10 translation probabilities, 2 word penalties counting the terminals in t and p, and a constant phrase penalty of 1.
It should be noted that remembering the pivot results in significantly larger rule tables. To save computational resources, several pruning methods are conceivable. Neubig et al. (2015) show that an effective pruning method in the case of a main target T 1 with the help of target T 2 is the T 1 -pruning method, namely, using L candidates of t 1 with the highest translation probability φ(t 1 |s) and selecting t 2 with highest φ(t 1 , t 2 |s) for each t 1 . We follow this approach, using the L best t, and the corresponding 1 best p .

Experimental Setup
We evaluate the proposed triangulation method through pivot translation experiments on the Europarl corpus, which is a multilingual corpus including 21 European languages (Koehn, 2005) widely used in pivot translation work. In our work, we perform translation among German (de), Spanish (es), French (fr) and Italian (it), with English (en) as the pivot language. To prepare the data for these 5 languages, we first use the Gale-Church alignment algorithm (Gale and Church, 1993) to retrieve a multilingual line-aligned corpus of about 900k sentences, then hold out 1,500 sentences each for tuning and test. In our basic training setup, we use 100k sentences for training both the TMs and the target LMs. We assume that in many situations, a large amount of English monolingual data is readily available and therefore, we train pivot LMs with different data sizes up to 2M sentences. As a decoder, we use Travatar (Neubig, 2013), and train SCFG TMs with its Hiero extraction code. Translation results are evaluated by BLEU (Papineni et al., 2002) and we tuned to maximize BLEU scores using MERT (Och, 2003). For trained and triangulated TMs, we use T 1 rule pruning with a limit of 20 rules per source rule. For decoding using MSCFG, we adopt the sequential search method.
We evaluate 6 translation methods: Direct: Translating with a direct SCFG trained on the source-target parallel corpus (not using a pivot language) for comparison.
Cascade: Cascading source-pivot and pivottarget translation systems.
Tri. SCFG: Triangulating source-pivot and pivot-target SCFG TMs into a source-target SCFG TM using the traditional method.

Experimental Results
The result of experiments using all combinations of pivot translation tasks for 4 languages via English is shown in Table 1. From the results, we can see that the proposed triangulation method considering pivot LMs outperforms the traditional triangulation method for all language pairs, and translation with larger pivot LMs improves the BLEU scores. For all languages, the pivot-remembering triangulation method with the pivot LM trained with 2M sentences achieves the highest score of the pivot translation methods, with gains of 0.4-1.2 BLEU points from the baseline method. This shows that remembering the pivot and using it to disambiguate results is consistently effective in improving translation accuracy. We can also see that the MSCFG triangulated model without using the pivot LM slightly outperforms the standard SCFG triangulation method for the majority of language pairs. It is conceivable that the additional scores of translation probabilities with pivot phrases are effective features that allow for more accurate rule selection.
Finally, we show an example of a translated sentence for which pivot-side ambiguity is resolved in the proposed triangulation method: Input (German): ich bedaure , daß es keine gemeinsame annäherung gegeben hat .
(Generated English Sentence) The derivation uses an MSCFG rule connecting "approccio" to "approach" in the pivot, and we can consider that appropriate selection of English words according to the context contributes to selecting relevant vocabulary in Italian.

Conclusion
In this paper, we have proposed a method for pivot translation using triangulation of SCFG rule tables into an MSCFG rule table that remembers the pivot, and performing translation with pivot LMs.
In experiments, we found that these models are effective in the case when a strong pivot LM exists. In the future, we plan to explore more refined methods to devising effective intermediate expressions, and improve estimation of probabilities for triangulated rules.