Cross-language Projection of Dependency Trees with Constrained Partial Parsing for Tree-to-Tree Machine Translation

Tree-to-tree machine translation (MT) that utilizes syntactic parse trees on both source and target sides suffers from the non-isomorphism of the parse trees due to parsing errors and the difference of annotation criterion between the two languages. In this paper, we present a method that projects dependency parse trees from the language side that has a high quality parser, to the side that has a low quality parser, to improve the isomorphism of the parse trees. We ﬁrst project a part of the dependencies with high conﬁdence to make a partial parse tree, and then complement the remaining dependencies with partial parsing constrained by the already projected dependencies. MT experiments verify the effectiveness of our proposed method.


Introduction
According to how syntactic parse trees are used in machine translation (MT), there are 4 types of MT approaches: string-to-string that does not use parse trees (Chiang, 2005;Koehn et al., 2007), string-to-tree that uses parse trees on the target side (Galley et al., 2006;Shen et al., 2008), treeto-string that uses parse trees on the source side (Quirk et al., 2005;Liu et al., 2006;Mi and Huang, 2008), and tree-to-tree that uses parse trees on both sides (Zhang et al., 2008;Richardson et al., 2015). Intuitively, the tree-to-tree approach seems to be the most appropriate. The reason is that it could preserve the structure information on both sides, which leads to fluent and accurate translations.
In practice, however, good quality parsers on both the source and target sides are difficult to ac- * Corresponding author.
quire. In many cases, the parsing quality of one side is much higher than that of the other side, because the higher quality side has a well annotated treebank or is linguistically easier to parse. For example, in the case of Japanese-Chinese MT that we study in this paper, the head-final characteristic of Japanese (Isozaki et al., 2010) makes the dependency parsing for Japanese much easier than that of Chinese. Currently, the dependency parsing accuracy of Japanese is over 90% (Kawahara and Kurohashi, 2006), while the Chinese parsing accuracy is less than 80% (Shen et al., 2012). Another problem is the annotation criterion difference of the treebanks in different languages, which are used for training the parsers. For example, the dependency annotations of noun phrases and coordination could be different among different languages. For example, in Japanese, noun phrases and coordination are annotated as modifier-head dependencies (Kawahara and Kurohashi, 2006), while in Chinese they are annotated as sibling dependencies (Shen et al., 2012). These two problems lead to the parse difference between the source and target parse trees, which affects the translation rule extraction in tree-to-tree MT that requires the isomorphism of the parse trees. This extremely limits the translation quality of tree-totree MT.
In this paper, we present an approach that projects dependency trees from a high quality (HQ) parser to a low quality (LQ) parser using alignment information. The projection could reduce the parsing errors on the LQ side, and address the annotation criterion difference problem. This can make the LQ trees isomorphic to the HQ trees, which can benefit the translation rule extraction in tree-to-tree MT, and thus improve the MT performance. The idea of cross-language projection of parse trees has been proposed previously, e.g., (Ganchev et al., 2009;Jiang et al., 2010 et al., 2015). However, few studies have been conducted in the context of dependency based tree-totree MT, which is the setting of this paper. In addition, we propose a novel constrained partial parsing method to address the word alignment problems such as unaligned words and alignment errors in projection. In detail, we first apply a partial projection step to project a part of the dependencies with high confidence judged by the alignment information and a projectivity criterion. We thus obtain a projected "partial tree." We then find the missing dependencies from this partial tree by applying a "partial parsing" method: we apply a parser to find the missing dependencies subject to respecting the projected dependencies, so that we obtain a full dependency tree. Initially, the LQ parser is used for the partial parsing process. Once the entire projection process has been finished, we select a part of the projected trees based on the dependency projection ratio of the partial projection step, and re-train a parser for the LQ side. This re-trained parser tends to be more isomorphic to the HQ parser, and thus we again apply it for the partial parsing process.
We conduct experiments with an open source dependency based tree-to-tree MT system Ky-otoEBMT 1 (Richardson et al., 2015) on the Japanese-Chinese language pair. Because of the improvement of the isomorphism of the source and target parse trees by our proposed method, we achieve significant MT performance improvements on both Japanese-to-Chinese and Chineseto-Japanese translation directions.
2 The Difficulties of Tree-to-Tree MT

Overview of the KyotoEBMT System
This study is conducted on the KyotoEBMT system (Richardson et al., 2015), which is a represen-1 http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KyotoEBMT tative dependency based tree-to-tree MT system. Figure 1 shows an overview of the KyotoEBMT system on Chinese-to-Japanese translation. The translation example database is automatically constructed from a parallel training corpus by means of a discriminative alignment model (Riesa et al., 2011). It contains "examples" that form the hypotheses to be combined during decoding. Note that both source and target sides of all the examples are stored in dependency trees. An input sentence is also parsed and transformed into a dependency tree. For all the subtrees in the input dependency tree, matching hypotheses are searched in the example database. This step is the most time consuming part, and a fast subtree retrieval method (Cromieres and Kurohashi, 2011) is used. There are many available hypotheses for one subtree, and also, there are many possible hypothesis combinations. The best combination is detected by a lattice-based decoder, which optimizes a loglinear model (Cromieres and Kurohashi, 2014). In the example in Figure 1, four hypotheses are used. They are combined and produce an output dependency tree, which is the final translation. For more details of the system, please refer to (Richardson et al., 2015).

The Translation Example Extraction Problem
One advantage of the KyotoEBMT system is that it can handle examples that are discontinuous as a word sequence but continuous structurally, because of the usage of both source and target parse trees. In Figure 2, for example, the translation example of "26-31: /4: 14:类 (show the similarity)" and "0-2: 30-35: /0-4:认为这 现 (I think that this phenomenon shows)" can be extracted by the KyotoEBMT system, because they are continuous in the parse trees. However, in phrase based MT (Koehn et al., 2007), both of these two translation examples could not be extracted. The reason for this is that "4: (show)" and "14:类 (similarity)" are discontinuous in the Chinese sentence; similarly, "0-2: (this phenomenon)" and "30-35: (I think that shows)" are discontinuous in the Japanese sentence.
On the other hand, it also adds the constraint that a translation example has to share the same structure on the parse trees to guarantee the quality !"#$%&'%()'**+*! (,#-! .,'%+&'%+%! (only include K+)" could not be extracted. The other reason is the annotation criterion difference. In Figure 2, for example, the translation example of "18: 19: /21:标 22:试样(standard sample)" could not be extracted, though both of the parses are correct. In Japanese this kind of noun phrase structure is annotated as the modifier-head, while in Chinese it is annotated as siblings depending on the last word.
One possible solution to address the above problem is to loosen the constraint for translation example extraction. For example, to extract the "18: 19: /21:标 22:试样(standard sample)" example caused by the annotation criterion difference, we might allow the extraction of examples that are modifier-head and sibling subtrees on the source and target sides, respectively. However, : An overview of our constrained partial parsing based projection method.
firstly, even the loosening in this degree could also lead to other noisy translation examples; secondly, what kind of loosening is required for the parse error case is unclear, because the types of parse errors are diverse. Therefore, instead of loosening the constraint, we choose the cross-lingual projection approach to address the problem. Figure 3 is an overview of our proposed constrained partial parsing method. Firstly, we apply a partial projection process to project a part of the dependencies from the HQ tree using the HQ tree, word alignment information and a projectivity criterion. Note that the word alignment information is omitted in Figure 3 for simplification. In Figure 3, the circled part in the HQ tree is projected. Next, we apply partial parsing to complement the other dependencies in the partially projected tree using the LQ parser. In Figure 3, as the LQ parser could parse the circled part in the original LQ tree correctly, it also complements the dependencies for the partially projected tree correctly. Once we obtained the projected trees, we select a part of the highly confident projected trees as training data to re-train the LQ parser. Finally, we apply the retrained LQ parser for the partial parsing process, which further improves the quality of projection.

Projection of Dependency Trees with Constrained Partial Parsing
In the remaining of this section, we describe the details of partial projection, partial parsing, and retraining of the LQ parser in Section 3.1, 3.2, and 3.3 respectively.

Direct Mapping for Dependency Tree Projection
We first present a direct mapping method for dependency tree projection using word alignment, which can be formalized as below. Given a parallel sentence pair (S, T ), where S = s 1 ...s i ...s n , and T = t 1 ...t j ...t m are sentences of the HQ and LQ sides, respectively; s i and t j denote the word index (which also denotes the node index in the dependency tree) in the corresponding sentences. We have a dependency tree for S denoted as T ree S = {(s i , s k )...} that is composed of a set of dependencies, where (s i , s k ) means that the word s i is dependent on the word s k . We also have an alignment The new LQ parse tree T ree new T is projected from T ree S . We first perform the following preprocessing for the unaligned HQ words.
• unaligned words (HQ side): If s i is an unaligned word, link the dependencies around s i . More specifically, if s i is unaligned, and (s h , s i ) ∈ T ree S , (s i , s k ) ∈ T ree S , we add (s h , s k ) to T ree S , and discard (s h , s i ) and (s i , s k ) from T ree S . This preprocess can make two distinct words separated by unaligned words be a modifier-head pair. For example, in Figure 2, because "32: (thing)" is an unaligned word, we add (30: (show), 33: (and)) to T ree S .
We then process each source node s i in T ree S in a top-down manner (from the root node to the leaf node) by applying the following rules divided by the alignment types.
• one to one alignment: If s i aligns to a unique t j , s k aligns to a unique t l , and (s i , s k ) ∈ T ree S , add (t j , t l ) to T ree new T . For example, in Figure 2, the Japanese dependency (0: (this), 1: (phenomenon)) is projected to the Chinese side as (1:这2: (this)) by applying this rule.
• many to one alignment: If (s i , s k , ...) aligns to t j , we take the head s r (e.g., s k ) from (s i , s k , ...) as the representative, and then perform the same process as in the one to one alignment case. For example, in Figure 2, a(33: 34: 35: (think), 0:认 为(think)) is a many to one alignment, and we select the head "34: " as the representative.
• one to many alignment: If s i aligns to several words (t j , t l , ...), similar to the many to one alignment case, we take the head t r (e.g., t j ) from (t j , t l , ...) based on the original LQ tree as the representative, and then perform the same process as in the one to one alignment case for s i and t r .
• many to many alignment: Reduce this to oneto-many and many-to-one cases, i.e., select the representatives for both sides, and then perform the same process as in the one to one alignment case.

Partial Projection with Direct Mapping
There are several cases that the direct mapping method could not deal with: 1. the other nodes in the one to many alignment case: For the nodes (e.g., t l ) (in (t j , t l , ...) that align to one word s i ) other than the representative t r , there are no clues to determine their dependencies during the projection.
2. unaligned words (LQ side): If t j is an unaligned word, there are also no clues for the projection. For example, in Figure 2, because the word Chinese "3:现 (phenomenon)", "15: (and)" and "20: ('s)" are unaligned words, we cannot determine their dependencies by projection.
3. alignment errors: Because the direct mapping method highly depends on word alignments, erroneous word alignments would lead to wrong projected dependency results. For example, in Figure 2, the Japanese word "12: (preferably)" is incorrectly aligned to the Chinese word "13: (extremely)"; this erroneous alignment would project the Japanese dependency (12: (preferably), 14:+) to the Chinese side, leading to a projected dependency of (13: (extremely), 19:+), which is obviously incorrect. Alignment errors could happen due to many factors, one of which is translation shift. The erroneous alignment in Figure 2 is caused by this.
Because of the existence of the above cases, we only apply the direct mapping method for partial projection. For the (1) and (2) cases, we leave the dependencies for these words as null. For the (3) case, we propose a projectivity criterion to detect the alignment error, and again leave the dependencies as null. Note that all of these three cases are processed during the top-down projection process.

Adding a Projectivity Criterion to the Projection Process
Projectivity is a property of dependency parsing, which informally means that there should not be crossing arcs in a dependency tree (Kubler et al., 2009). For example, T ree new T = {(0, 2)(1, 3)(2, 3)(3, −1)} (-1 denotes the root) is not projective, because the arc of modifier-head pair (0,2) and that of modifier-head pair (1,3) is crossed. We use the projectivity property to detect alignment errors during the top-down projection process. Suppose that by processing the HQ tree from the root, we already have a partially projected LQ subtree. Next, we want to project a new dependency in the HQ tree to the LQ side. If adding this newly projected dependency to the partially projected subtree leads to non-projectivity, 2 we give up this projection and leave the dependency as null.

Partial Parsing
After the partial projection step, we obtain partial projected trees, with null dependencies discussed in Section 3.1.2. We then perform partial parsing to complement these null dependencies. Before the description of the partial parsing method, we first review the formalism of dependency parsing used in many previous studies such as (Kubler et al., 2009;Shen et al., 2012): where X = x 1 ...x i ...x n is the input sentence, Y is a candidate tree, Φ(X) is a set of all possible dependency trees over X. Y can be denoted as is a dependency from the modifier x m to the head x h . The problem of dependency parsing is to search the best tree from Φ(X) that maximizes the score function score(Y, X). The score function can be factorized as the summation of the scores of its factors (subtrees): The score function for each factor is denoted as the inner product of a feature and a weight vector: The weight vector can be learnt by e.g., the averaged structured perceptron algorithm (Collins, 2002) on an annotated treebank. During parsing, the parser would utilize the learnt weight vector to determine the best parse tree. In our partial parsing method, we aim to keep the dependencies in partial projected trees, while complement the null dependencies to construct a projective tree. To realize this, we set extremely high scores to the projected dependencies to maximize the score(F, X) for these dependencies, while for the null dependencies we set relatively small scores. Doing so, the parser would search the best tree that respects the partial projected dependencies. In our experiments, we used the projective second order graph based dependency parser (Shen et al., 2012). We set the initial dependency scores for the projected dependencies to 1e12, and 0 to the null dependencies.

Re-train a New Low Quality Side Parser
Re-training a new LQ parser on the projected trees is necessary for two reasons. Initially, we use the original LQ parser for the partial parsing process, because we do not have a better choice; due to the low accuracy and the annotation criterion difference problem of the LQ parser, we have the risk that it will produce unsatisfying parsing results, especially for the trees with a low ratio of dependencies being projected. Secondly, if we perform the LQ-to-HQ direction MT, we should make the parsed trees of the input sentences isomorphic to the projected trees. Re-training a new LQ parser on the projected trees could address both of these two problems. As the re-trained parser tend to be more isomorphic to the HQ parser, it could be more effective for the partial parsing process, and could be applied for parsing the input sentences for the LQ-to-HQ direction MT task. Therefore, after the entire projection process, we select a part of the projected trees, and re-train a parser for the LQ side. How to select the projected trees for training the new LQ parser is an open question. The main question is how to take the balance of the quality and quantity of the projected trees. Currently, the selection criterion is empirical based on the ratio of dependencies projected by the partial projection process in a tree, defined by ratio = #projected dependencies #all dependencies (4) The motivation behind this is that the more dependencies projected by the partial projection in a tree, the more isomorphic would the projected tree be as the HQ tree, and the less affect would be introduced by the original LQ parser during the partial parsing process. We set a threshold, and use the trees with the ratio higher than the threshold for training the parser. We tried several thresholds in our preliminary experiments, and selected the best threshold of 0.78 (170k trees) based on the MT performance. 3

Experiments
We conducted Japanese-Chinese MT experiments to verify the effectiveness of our constrained partial parsing based projection method.

Settings
We conducted experiments on the scientific domain MT task on the Japanese-Chinese paper excerpt corpus (ASPEC-JC), 4 which is one subtask of the workshop on Asian translation (WAT) 5 (Nakazawa et al., 2015). The ASPEC-JC task uses 672,315, 2,090, and 2,107 sentences for training, development, and testing, respectively. We used the tree-to-tree MT system KyotoEBMT 6 (Richardson et al., 2015) for all of our MT experiments. For Chinese, we used the Chinese analyzing tool KyotoMorph 7 proposed by Shen et al. (2014) for segmentation and part-of-speech (POS) tagging, and the SKP parser 8 (Shen et al., 2012) for parsing. As the baseline Chinese parser, we trained SKP with the Penn Chinese treebank version 5 (CTB5) 9 containing 18k sentences in news domain, and an in-house scientific domain treebank of 10k sentences. For Japanese, we used JU-MAN 10 (Kurohashi et al., 1994) for morphological analyzing, and the KNP parser for parsing 11 (Kawahara and Kurohashi, 2006). We trained two 5-gram language models for Chinese and Japanese, respectively, on the training data of the ASPEC corpus using the KenLM toolkit 12 with interpolated Kneser-Ney discounting, and used them for all the experiments. In all of our experiments, we used the discriminative alignment model Nile 13 (Riesa et al., 2011) for word alignment; tuning was performed by the k-best batch MIRA (Cherry and Foster, 2012) with 10 iterations, and it was re-run for every experiment. Note that, in our task, Japanese is the HQ parser side, and Chinese is the LQ parser side, because of the parsing accuracy difference (90% v.s. 80%). Therefore, in our experiments, we projected the Japanese parse trees to Chinese. We compared the MT performance of our proposed projection method with the baseline Chinese parser. For Japanese-to-Chinese MT experiments, we compared the MT results of the Chinese training data parsed by the baseline parsed, to those of the projected trees. For Chinese-to-Japanese MT, we also re-parsed the development and test Chinese sentences using the SKP model trained on the projected Chinese trees, for the comparison. Table 1 shows the results, where KyotoEBMT is the baseline system that used the Chinese parser trained on CTB5; Baseline partial parsing denotes the projection systems that used the Chinese parser trained on CTB5 for the partial parsing process; Re-trained partial parsing denotes the systems that used the Chinese parser re-trained on the projected trees for the partial parsing process. For reference, we also show the MT performance of the phrase based, string-to-tree, and tree-to-string systems, which are based on the open-source GIZA++/Moses pipeline (Koehn et al., 2007). Note that in all of the Moses, string-totree, and tree-to-string settings, Japanese is always in the string format, and Chinese is parsed by the Berkeley parser 14 (Petrov and Klein, 2007). 15 The significance tests were performed using the bootstrap resampling method (Koehn, 2004).

MT Results
We can see that, the Baseline KyotoEBMT system outperforms the Moses, string-to-tree, and tree-to-string systems, which verifies the effectiveness of the tree-to-tree approach. The performance difference of KyotoEBMT against the other three MT approaches on the Ja-to-Zh direction is much larger than those of the Zh-to-Ja direction. The reason for this is that KyotoEBMT is much more sensitive to the parsing accuracy on the source side, because the source tree is utilized in the ordering of the final translation. Therefore using Chinese as the source side limits the effectiveness 14 https://github.com/slavpetrov/berkeleyparser 15 We show the MT performance of Moses that only parsed the Chinese data, because these were the baseline systems of WAT.
System Ja-to-Zh Zh-to-Ja of the KyotoEBMT system. Baseline partial parsing performs significantly better than the Baseline KyotoEBMT, and Re-trained partial parsing further improves the performance significantly. We also observe slightly more improvement on the Zh-to-Ja direction than the Ja-to-Zh direction. The reason is similar to the one above that in Zh-to-Ja task, we not only improve the translation example extraction, but also the quality of the input trees.
To further understand the reason for the MT improvement, we investigated the number of hypotheses for the test sentences. The number of hypotheses for a test sentence is the number all the matching hypotheses in the example database for all the subtrees in the input dependency structure of the test sentence (refer to Section 2.1). The entire number of hypotheses for all the test sentences of different systems are shown in table 2. We can see that the number of hypotheses for the partial parsing systems is greatly larger than the baseline KyotoEBMT system. The reason for this is that our projection method significantly increased the isomorphism of the source and target trees in the training corpus, making more translation examples being extractable. More hypotheses are potentially to improve the final MT performance.
In addition, we investigated the translation results of the Baseline KyotoEBMT and Re-trained partial parsing systems. We found that there are !"#$%&'$()*+,+-!./! 0$1,2"&'$3(4"25"%(4" 2#&'6(( 7'48,! 98,48,! 7'48,! 98,48,!  three reasons that lead to the improvement. We explain these reasons through an improved example of Zh-to-Ja translation shown in Figure 4. The first reason is the improvement of the input parse tree. There is a crucial parsing error in the input tree of the Baseline KyotoEBMT system. The Ky-otoMorph incorrectly assigned a wrong POS tag "VV (verb)" for the word "15: (inhibition)", which should be "NN (noun)" in fact. This leads to this word be the head of the whole following noun phrase. Using this erroneous input parse tree, this word is also translated into the head of the entire noun phrase. Our Re-trained partial parsing correctly parsed the word "15: (inhibition)" as a part of the noun phrase "15-18: 氧 实 验(inhibition of oxygen consumption test)", leading to the correct translation. Although the Retrained partial parsing could not correct the wrong POS tag of the word, because we also used this kind of data to train the parser, it successfully parsed this sentence. The second reason is the increase of translation hypotheses. The number of hypotheses for the Baseline KyotoEBMT system is 2,447, while the number of hypotheses of the Re-trained partial parsing system is 3,311. The number of hypotheses for "0:针 对...7:进 8: (about...performed)" increased from 52 to 176 by the Re-trained partial parsing system, which improved the translation. The third reason is the isomorphism of the input and output target dependency trees. Note that the noun phrases "15-18: 氧 实验(inhibition of oxygen consumption test)" and "20-23: 实验(large-scale flea acute toxicity test)" are parsed as siblings in the Baseline KyotoEBMT system, while in our Re-trained partial parsing model they are parsed as modifier-head dependencies, which are isomorphic to the Japanese parse tree. One unsatisfying point is that "21: (flea acute)" is an unknown word, which is a difficult technical term that could not be translated by both of the two systems.

Related Work
There are many previous studies that propose many methods to address the difficulties in projecting the parse trees from a resource rich language (e.g., English) to a low resource language, to improve the parsing accuracy of the low resource language. The difficulties in projection can be mainly divided into two categories: word alignment errors and annotation criterion difference (Ganchev et al., 2009).
To address the word alignment error problem, several studies have proposed to train a target parser on high confidence partially projected trees. Ganchev et al. (2009) presented a partial projection method with constraints such as language-specific annotation rules. They then trained a target parser using the partially projected trees. Spreyer and Kuhn (2009) proposed a similar method that trains both graph-based and transition-based dependency parsers on the partially projected trees. Rasooli and Collins (2015) proposed a method to train a target parser on "dense" projected trees. The "dense" projected trees might only contain a part of dependencies over a threshold. Our proposed method differs from the previous studies in several aspects: we propose the use of the projectivity criterion for partial projection; we utilize the original target parser and propose a constrained partial parsing algorithm; we re-train a target parser on the full trees generated by the partial parsing.
To address the annotation criterion difference problem in projection, Hwa et al. (2005) firstly projected the dependency parse trees, and then applied post projection transformations based on manually created rules. Jiang et al. (2011) presented a method that tolerates the syntactic nonisomorphism between languages. This allows the projected parse trees do not have to follow the annotation criterion of the source parse trees. Our proposed method does not adjust the annotation criterion difference between the source and the projected trees, because in our tree-to-tree MT task, we prefer isomorphic trees.
Only a few studies have been conducted to improve MT performance via projection. For string-to-string MT (Koehn et al., 2007),  proposed a pre-ordering method that projects target side constituency trees to the source side, and then generates pre-ordering rules based on the projected trees. For tree-to-string MT, Jiang et al. (2010) combined projection and supervised constituency parsing by guiding the parsing procedure of the supervised parser with the projected parser. They showed that the guided parser achieved comparable MT results on a treeto-string system (Liu et al., 2006), compared to a normal supervised parser trained on thousands of CTB trees. For tree-to-tree MT (Richardson et al., 2015), Shen et al. (2015) proposed a naive projection method. They complemented the remaining dependencies for a partially projected tree with a backtracking method. Namely, they reused the dependencies in the original target tree for the complement without considering the partially projected dependencies. In contrast, in this paper we propose partial parsing for the complement, in which we search for the best parse tree by taking account of the partially projected dependencies.

Conclusion
In this paper, we proposed a constrained partial parsing method for projection to address the non-isomorphic parse tree problem in a dependency based tree-to-tree MT system. Experiments verified the effectiveness of our proposed method. As future work, firstly, we plan to design a better way for selecting the projected trees for re-training the LQ parser. Secondly, we plan to perform the partial parsing in several iterations. Finally, we plan to conduct experiments on more language pairs to show the language-dependence of our proposed method.