Tree as a Pivot: Syntactic Matching Methods in Pivot Translation

Pivot translation is a useful method for translating between languages with little or no parallel data by utilizing parallel data in an intermediate language such as English. A popular approach for pivot translation used in phrase-based or tree-based translation models combines source-pivot and pivot-target translation models into a source-target model, as known as triangulation . However, this combination is based on the constituent words’ surface forms and often produces incorrect source-target phrase pairs due to semantic ambiguity in the pivot language, and interlingual differences. This degrades translation accuracy. In this paper, we propose a approach for the triangulation using syntactic subtrees in the pivot language to distinguish pivot language words by their syntactic roles to avoid incorrect phrase combinations. Experimental results on the United Nations Parallel Corpus show the proposed method gains in all tested combinations of language, up to 2.3 BLEU points. 1


Introduction
In statistical machine translation (SMT) (Brown et al., 1993), it is known that translation with models trained on larger parallel corpora can achieve greater accuracy (Dyer et al., 2008).Unfortunately, large bilingual corpora are not readily available for many language pairs, particularly those that do not include English.One effective solution to overcome the scarceness of bilingual data is to introduce a pivot language for which paral- lel data with the source and target languages exists (de Gispert and Mariño, 2006).Among various methods using pivot languages, one popular and effective method is the triangulation method (Utiyama and Isahara, 2007;Cohn and Lapata, 2007), which first combines sourcepivot and pivot-target translation models (TMs) into a source-target model, then translates using this combined model.The procedure of triangulating two TMs into one has been examined for different frameworks of SMT and its effectiveness has been confirmed both in Phrase-Based SMT (PBMT) (Koehn et al., 2003;Utiyama and Isahara, 2007) and in Hierarchical Phrase-Based SMT (Hiero) (Chiang, 2007;Miura et al., 2015).However, word sense ambiguity and interlingual differences of word usage cause difficulty in accurately learning correspondences between source and target phrases, and thus the accuracy obtained by triangulated models lags behind that of models trained on direct parallel corpora.
In the triangulation method, source-pivot and pivot-target phrase pairs are connected as a sourcetarget phrase pair when a common pivot-side phrase exists.In Figure 1 (a), we show an example of standard triangulation on Hiero TMs that combines hierarchical rules of phrase pairs by matching pivot phrases with equivalent surface forms.This example also demonstrates problems of ambiguity: the English word "record" can correspond to several different parts-of-speech according to the context.More broadly, phrases including this word also have different possible grammatical structures, but it is impossible to uniquely identify this structure unless information about the surrounding context is given.This varying syntactic structure will affect translation.For example, the French verb "enregistrer" corresponds to the English verb "record", but the French noun "dossier" also corresponds to "record" -as a noun.As a more extreme example, Chinese is a languages that does not have inflections according to the part-of-speech of the word.As a result, even in the contexts where "record" is used with different parts-of-speech, the Chinese word "记录" will be used, although the word order will change.These facts might result in an incorrect connection of "[X1] enregistrer [X2]" and "[X2] [X1] 记录" even though proper correspondence of "[X1] enregistrer [X2]" and "[X1] dossier [X2]" would be "[X1] 记 录 [X2]" and "[X2] [X1] 记 录".Hence a superficial phrase matching method based solely on the surface form of the pivot will often combine incorrect phrase pairs, causing translation errors if their translation scores are estimated to be higher than the proper correspondences.
Given this background, we hypothesize that disambiguation of these cases would be easier if the necessary syntactic information such as phrase structures are considered during pivoting.To incorporate this intuition into our models, we propose a method that considers syntactic information of the pivot phrase, as shown in Figure 1 (b).In this way, the model will distinguish translation rules extracted in contexts in which the English symbol string "[X1] record [X2]" behaves as a verbal phrase, from contexts in which the same string acts as nominal phrase.
Specifically, we propose a method based on Synchronous Context-Free Grammars (SCFGs) (Aho and Ullman, 1969;Chiang, 2007), which are widely used in tree-based machine translation frameworks ( §2).After describing the baseline triangulation method ( §3), which uses only the surface forms for performing triangulation, we propose two methods for triangulation based on syntactic matching ( §4).The first places a hard restriction on exact matching of parse trees ( §4.1) included in translation rules, while the second places a softer restriction allowing partial matches ( §4.2).To investigate the effect of our proposed method on pivot translation quality, we perform experiments of pivot translation on the United Nations Parallel Corpus (Ziemski et al., 2016), which shows that our method indeed provide significant gains in accuracy (of up to 2.3 BLEU points), in almost all combinations of 5 languages with English as a pivot language ( §5).In addition, as an auxiliary result, we compare pivot translation using the proposed method with zero-shot neural machine translation, and find that triangulation of symbolic translation models still significantly outperforms neural MT in the zero-resource scenario.

Synchronous Context-Free Grammars
In this section, first we cover SCFGs, which are widely used in machine translation, particularly hierarchical phrase-based translation (Hiero) (Chiang, 2007).In SCFGs, the elementary structures used in translation are synchronous rewrite rules with aligned pairs of source and target symbols on the right-hand side: where X is the head symbol of the rewrite rule, and s and t are both strings of terminals and nonterminals on the source and target side respectively.Each string in the right side pair has the same number of indexed non-terminals, and identically indexed non-terminals correspond to eachother.For example, a synchronous rule could take the form of: Synchronous rules can be extracted based on parallel sentences and automatically obtained word alignments.Each extracted rule is scored with phrase translation probabilities in both directions ϕ(s|t) and ϕ(t|s), lexical translation probabilities in both directions ϕ lex (s|t) and ϕ lex (t|s), a word penalty counting the terminals in t, and a constant phrase penalty of 1.
At translation time, the decoder searches for the target sentence that maximizes the derivation probability, which is defined as the sum of the scores of the rules used in the derivation, and the log of the language model (LM) probability over the target strings.When not considering an LM, it is possible to efficiently find the best translation for an input sentence using the CKY+ algorithm (Chappelier et al., 1998).When using an LM, the expanded search space is further reduced based on a limit on expanded edges, or total states per span, through a procedure such as cube pruning (Chiang, 2007).

Hierarchical Rules
In this section, we specifically cover the rules used in Hiero.Hierarchical rules are composed of initial head symbol S, and synchronous rules containing terminals and single kind of non-terminals X.2 Hierarchical rules are extracted using the same phrase extraction procedure used in phrase-based translation (Koehn et al., 2003) based on word alignments, followed by a step that performs recursive extraction of hierarchical phrases (Chiang, 2007).
For example, hierarchical rules could take the form of: From these rules, we can translate the input sentence by derivation: The advantage of Hiero is that it is able to achieve relatively high word re-ordering accuracy (compared to other symbolic SMT alternatives such as standard phrase-based MT) without language-dependent processing.On the other hand, since it does not use syntactic information and tries to extract all possible combinations of rules, it has the tendency to extract very large translation rule tables and also tends to be less syntactically faithful in its derivations.

Explicitly Syntactic Rules
An alternative to Hiero rules is the use of synchronous context-free grammar or synchronous tree-substitution grammar (Graehl and Knight, 2004) rules that explicitly take into account the syntax of the source side (tree-to-string rules), target side (string-to-tree rules), or both (tree-to-tree rules).Taking the example of tree-to-string (T2S) rules, these use parse trees on the source language side, and the head symbols of the synchronous rules are not limited to S or X, but instead use non-terminal symbols corresponding to the phrase structure tags of a given parse tree.For example, T2S rules could take the form of: .
Here, parse subtrees of the source language rules are given in the form of S-expressions.
From these rules, we can translate from the parse tree of the input sentence by derivation: , 委员会 的 主席团 成員 ⟩ In this way, it is possible in T2S translation to obtain a result conforming to the source language's grammar.This method also has the advantage the number of less-useful synchronous rules extracted by syntax-agnostic methods such as Hiero are reduced, making it possible to learn more compact rule tables and allowing for faster translation.

Standard Triangulation Method
In the triangulation method by Cohn and Lapata (2007), we first train source-pivot and pivot-target rule tables as T SP and T P T respectively.Then we search T SP and T P T for source-pivot and pivottarget rules having a common pivot phrase, and synthesize them into source-target rules to create rule table T ST : For all the combined source-target rules, phrase translation probability ϕ(•) and lexical translation probability ϕ lex (•) are estimated according to the following equations: The equations ( 11)-( 14) are based on the memoryless channel model, which assumes: . For example, in equation ( 15), it is assumed that the translation probability of target phrase given pivot and source phrases is never affected by the source phrase.However, it is easy to come up with examples where this assumption does not hold.Specifically, if there are multiple interpretations of the pivot phrase as shown in the example of Figure 1, source and target phrases that do not correspond to each other semantically might be connected, and over-estimation by summing products of the translation probabilities is likely to cause failed translations.

Triangulation with Syntactic Matching
In the previous section, we explained about the standard triangulation method and mentioned that the pivot-side ambiguity causes incorrect estimation of translation probability and the translation accuracy might decrease.To address this problem, it is desirable to be able to distinguish pivotside phrases that have different syntactic roles or meanings, even if the symbol strings are exactly equivalent.In the following two sections, we describe two methods to distinguish pivot phrases that have syntactically different roles, one based on exact matching of parse trees, and one based on soft matching.

Exact Matching of Parse Subtrees
In the exact matching method, we first train pivotsource and pivot-target T2S TMs by parsing the pivot side of parallel corpora, and store them into rule tables as T P S and T P T respectively.Synchronous rules of T P S and T P T take the form of X → ⟨p, s⟩ and X → ⟨ p, t ⟩ respectively, where p is a symbol string that expresses pivot-side parse subtree (S-expression), s and t express source and target symbol strings.The procedure of synthesizing source-target synchronous rules essentially follows equations ( 11)-( 14), except using T P S instead of T SP (direction of probability features is reversed) and pivot subtree p instead of pivot phrase p.Here s and t do not have syntactic information, therefore the synthesized synchronous rules should be hierarchical rules explained in §2.2.
The matching condition of this method has harder constraints than matching of superficial symbols in standard triangulation, and has the potential to reduce incorrect connections of phrase pairs, resulting in a more reliable triangulated TM.On the other hand, the number of connected rules decreases as well in this restricted triangulation, and the coverage of the triangulated model might be reduced.Therefore it is important to create TMs that are both reliabile and have high coverage.

Partial Matching of Parse Subtrees
To prevent the problem of the reduction of coverage in the exact matching method, we also propose a partial matching method that keeps coverage just like standard triangulation by allowing connection of incompletely equivalent pivot subtrees.To estimate translation probabilities in partial matching, we first define weighted triangulation generalizing the equations ( 11)-( 14) of standard triangulation with weight function ψ(•): where pS ∈ T SP and pT ∈ P P T are pivot parse subtrees of source-pivot and pivot-target synchronous rules respectively.By adjusting ψ(•), we can control the magnitude of the penalty for the case of incompletely matched connections.If we define ψ( pT | pS ) = 1 when pT is equal to pS and ψ( pT | pS ) = 0 otherwise, equations ( 17)-( 20) are equivalent with equations ( 11)-( 14).
Better estimating ψ(•) is not trivial, and cooccurrence counts of pS and pT are not available.Therefore we introduce a heuristic estimation method as follows: w(p, pT ) (22)   w( pS, pT ) = where f lat(p) returns the symbol string of p keeping non-terminals, and T reeEditDistance( pS , pT ) is minimum cost of a sequence of operations (contract an edge, uncontract an edge, modify the label of an edge) needed to transform pS into pT (Klein, 1998).
According to equations ( 21)-( 24), we can assure that incomplete match of pivot subtrees leads d(•) ≥ 1 and penalizes such that ψ(•) ≤ 1/e d ≤ 1/e, while exact match of subtrees leads to a value of ψ(•) at least e ≈ 2.718 times larger than when using partially matched subtrees.

Experimental Set-Up
To investigate the effect of our proposed approach, we evaluate the translation accuracy through pivot translation experiments on the United Nations Parallel Corpus (UN6Way) (Ziemski et al., 2016).UN6Way is a line-aligned multilingual parallel corpus that includes data in English (En), Arabic (Ar), Spanish (Es), French (Fr), Russian (Ru) and Chinese (Zh), covering different families of languages.It contains more than 11M sentences for each language pair, and is therefore suitable for multilingual translation tasks such as pivot translation.In these experiments, we fixed English as the pivot language considering that it is the language most frequently used as a pivot language.This has the positive side-effect that accurate phrase structure parsers are available in the pivot language, which is good for our proposed method.We perform pivot translation on all the combinations of the other 5 languages, and compared the accuracy of each method.For tokenization, we adopt Sen-tencePiece, 3 an unsupervised text tokenizer and detokenizer, that is although designed mainly for neural MT, we confirmed that it also helps to reduce training time and even improves translation accuracy in our Hiero model as well.We first trained a single shared tokenization model by feeding a total of 10M sentences from the data of all the 6 languages, set the maximum shared vocabulary size to be 16k, and tokenized all available text with the trained model.We used English raw text without tokenization for phrase structure analysis and for training Hiero and T2S TMs on the pivot side.To generate parse trees, we used the Ckylark PCFG-LA parser (Oda et al., 2015), and filtered out lines of length over 60 tokens from all the parallel data to ensure accuracy of parsing and alignment.About 7.6M lines remained.Since Hiero requires a large amount of computational resources for training and decoding, so we decided not to use all available training data but first 1M lines for training each TM.As a decoder, we use Travatar (Neubig, 2013), and train Hiero and T2S TMs with its rule extraction code.We train 5-gram LMs over the target side of the same parallel data used for training TMs using KenLM (Heafield, 2011).For testing and parameter tuning, we used the first 1,000 lines of the 4,000 lines test and dev sets respectively.For the evaluation of translation results, we first detokenize with the Senten-cePiece model and re-tokenized with the tokenizer of the Moses toolkit (Koehn et al., 2007) for Arabic, Spanish, French and Russian and re-tokenized Chinese text with Kytea tokenizer (Neubig et al., 2011), then evaluated using case-sensitive BLEU-4 (Papineni et al., 2002).
We evaluate 6 translation methods:

Direct:
Translating with a Hiero TM directly trained on the source-target parallel corpus without using pivot language (as an oracle).

Tri. TreeExact
Triangulating pivot-source and pivot-target T2S TMs into a source-target Hiero TM using the proposed exact matching of pivot subtrees (proposed 1, §4.1).

Tri. TreePartial
Triangulating pivot-source and pivot-target T2S TMs into a source-target Hiero TM using the proposed partial matching of pivot subtrees (proposed 2, §4.2).

Experimental Results
The result of experiments using all combinations of pivot translation tasks for 5 languages via English is shown in Table 1.From the results, we can see that the proposed partial matching method of pivot subtrees in triangulation outperforms the standard triangulation method for all language pairs and achieves higher or almost equal scores than proposed exact matching method.The exact matching method also outperforms the standard triangulation method in the majority of the language pairs, but has a lesser improvement than partial matching method.In Table 2 we show the comparison of coverage of each proposed triangulated method.From this table, we can see that the exact matching method reduces several percent in number of unique phrases while the partial matching method keeps the same coverage with surfaceform matching.We can consider that it is one of the reasons of the difference in improvement stability between the partial and exact matching methods.
We show an example of a translated sentences for which pivot-side ambiguity is resolved in the the syntactic matching methods:

Source Sentence in French:
La Suisse encourage tous les États parties

Corresponding Sentence in English:
Switzerland encourages all parties to support the current conceptual work of the secretariat.

Comparison with Neural MT:
Recent results (Firat et al., 2016;Johnson et al., 2016) have found that neural machine translation systems can gain the ability to perform translation with zero parallel resources by training on multiple sets of bilingual data.However, previous work has not examined the competitiveness of these methods with pivot-based symbolic SMT frameworks such as PBMT or Hiero.In this section, we compare a zero-shot NMT model (detailed parameters in  Johnson et al. (2016).To train and evaluate NMT models, we adopt NMTKit. 7From the results we see the tendency of NMT that directly trained model achieves high translation accuracy even for translation between languages of different families, on the other hand, the accuracy is drastically reduced in the situation when there is no sourcetarget parallel corpora for training.Cascade is one immediate method connecting two TMs, and NMT cascade translation shows the medium performance in this experiment.In our setting, while bilingually trained NMT systems were competitive or outperformed Hiero-based models, zeroshot translation is uniformly weaker.This may be because we used only 1 LSTM layer for encoder/decoder, or because the amount of parallel corpora or language pairs were not sufficient.Thus, we can posit that while zero-shot translation has demonstrated reasonable results in some settings, successful zero-shot translation systems are far from trivial to build, and pivot-based symbolic MT systems such as PBMT or Hiero may still be a competitive alternative.

Conclusion
In this paper, we have proposed a method of pivot translation using triangulation with exact or partial matching method of pivot-side parse subtrees.
In experiments, we found that these triangulated models are effective in particular when allowing partial matching.To estimate translation probabilities, we introduced heuristic that has no guarantee to be optimal.Therefore in the future, we plan to explore more refined estimation methods that utilize machine learning.

Table 1 :
Comparison of each triangulation methods.Bold face indicates the highest BLEU score in pivot translation, and daggers indicate statistically significant gains over Tri.Hiero ( † : p < 0.05, ‡ : p < 0.01).

Table 2 :
Comparison of rule table coverage in proposed triangulation methods.

Table 3 )
with our pivot-based Hiero models.

Table 3 :
Main parameters of NMT trainingDirect NMT is trained with the same data of Direct Hiero, Cascade NMT translates by bridging source-pivot and pivot-target NMT models, and Zero-Shot NMT is trained on single shared model with pvt ↔ {src,target} parallel data according to