Cross-Lingual Dependency Parsing Using Code-Mixed TreeBank

Treebank translation is a promising method for cross-lingual transfer of syntactic dependency knowledge. The basic idea is to map dependency arcs from a source treebank to its target translation according to word alignments. This method, however, can suffer from imperfect alignment between source and target words. To address this problem, we investigate syntactic transfer by code mixing, translating only confident words in a source treebank. Cross-lingual word embeddings are leveraged for transferring syntactic knowledge to the target from the resulting code-mixed treebank. Experiments on University Dependency Treebanks show that code-mixed treebanks are more effective than translated treebanks, giving highly competitive performances among cross-lingual parsing methods.


Introduction
Treebank translation (Tiedemann et al., 2014;Tiedemann, 2015;Tiedemann and Agić, 2016) has been considered as a method for cross-lingual syntactic transfer. Take dependency grammar for instance. Given a source treebank, machine translation is used to find target translations of its sentences. Then word alignment is used to find mappings between source and target words, so that source syntactic dependencies can be projected to the target translations. Following, a postprocessing step is applied by removing unaligned target words, in order to ensure that the resulting target syntax forms a valid dependency tree, The method has shown promising performance for unsupervised cross-lingual dependency parsing among transfer methods (McDonald et al., 2011;Täckström et al., 2012;Rasooli and Collins, 2015;Guo et al., 2016b). * Corresponding author.  The treebank translation method, however, suffers from various sources of noise. For example, machine translation errors directly affect the resulting treebank, by introducing ungrammatical word sequences. In addition, the alignments between source and target words may not be isomorphic due to inherent differences between languages or paraphrasing during translation. For example, in the case of Figure 1, the English words "are" and "being", and the Swedish word "med", do not have corresponding word-level translation. In addition, it can be perfect to express "as soon as they can" using "very quickly" in a translation, which looses word alignment information because of the longer span. Finally, errors in automatic word alignments can also bring noise. Such alignment errors can directly affect grammaticality of the resulting target treebank due to deletion of unaligned words during post-processing, or cause lost or mistaken dependency arcs.
We consider a different approach for translation-based syntactic knowledge transfer, which aims at making the best use of source syntax with the minimum noise being introduced. To this end, we leverage recent advances in cross-lingual word representations, such as cross-lingual word clusters (Täckström et al., 2012) and cross-lingual word embeddings , which allow words from different languages to reside within a consistent feature vector space according to structural similarities between words. Thus, they offer a bridge on the lexical level between different languages (Ammar et al., 2016).
A cross-lingual model can be trained by directly using cross-lingual word representations on a source treebank . Using this method, knowledge transfer can be achieved on the level of token correspondences. We take this approach as a naive baseline. To further introduce structural feature transfer, we transform a source treebank into a code-mixed treebank by considering word alignments between a source sentence and its machine translation. In particular, source words with highly confident target alignments are translated into target words by consulting the machine translation output, so that target word vectors can be directly used for learning target syntax. In addition, continuous spans of target words of the code-mixed treebank are reordered according to the translation, so that target grammaticality can be exploited to the maximum potent.
We conduct experiments on Universal Dependency Treebanks (v2.0) Nivre et al., 2016). The results show that a codemixed treebank can bring significantly better performance compared to a fully translated treebank, resulting in averaged improvements of 4.30 points on LAS. The code and related data will be released publicly available under Apache License 2.0. 1

Related Work
Existing work on cross-lingual transfer can be classified into two categories. The first aims to train a dependency parsing model on source treebanks (McDonald et al., 2011;Guo et al., 2016a,b), or their adapted versions (Zhao et al., 2009;Tiedemann et al., 2014;Wang et al., 2017) in the target language. The second category, 1 https://github.com/zhangmeishan/CodeMixedTreebank namely annotation projection, aims to produce a set of large-scale training instances full of automatic dependencies by parsing parallel sentences (Hwa et al., 2005;Rasooli and Collins, 2015). The two broad methods are orthogonal to each other, and can both make use of the lexicalized dependency models trained with cross-lingual word representations Collins, 2017, 2019).
Source Treebank Adaption. There has been much work on unsupervised cross-lingual dependency parsing by direct source treebank transferring. Several researchers investigate delexicalized models where only non-lexical features are used in the models (Zeman and Resnik, 2008;Cohen et al., 2011;McDonald et al., 2011;Naseem et al., 2012;Rosa and Zabokrtsky, 2015). All the features in these models are language independent, and are consistent across languages and treebanks. Thus they can be applied into target languages directly.
Subsequent research proposes to exploit lexicalized features to enhance the parsing models, by resorting to cross-lingual word representations (Täckström et al., 2012;Duong et al., 2015b,a;Zhang and Barzilay, 2015;Guo et al., 2016b;Ammar et al., 2016;Wick et al., 2016;de Lhoneux et al., 2018). Cross-lingual word clusters and cross-lingual word embeddings are two main sources of features for transferring knowledge between source and target language sentences. These studies enable us to train lexicalized models on code-mixed treebanks as well. Thus here we integrate the cross-lingual word representations as well, which gives more direct interaction between source and target words.
Our work follows another mainstream method of this line of work, namely treebank translation (Tiedemann et al., 2014;Tiedemann, 2015;Tiedemann and Agić, 2016), which aims to adapt an annotated source treebank into the target language by machine translation. In addition, the targetside sentences are produced by machine translation. Previous work aims to build a well-formed tree (Tiedemann and Agić, 2016) from source dependencies, solving word alignment conflicts by heuristic rules. In contrast, we use partial translation instead to avoid unnecessary noise.
Annotation Projection. The annotation projection approach relies on a set of parallel sentences between the source and target languages (Hwa et al., 2005;Ganchev et al., 2009). In par- ... e(wn) e(cn) e(tn) ... ticular, a source parser trained on the source treebank is used to parse the source-side sentences of the parallel corpus. The source dependencies are then projected onto the target sentences according to word alignments. Different strategies can be applied for the dependency projection task (Ma and Xia, 2014;Rasooli and Collins, 2015;Xiao and Guo, 2015;Agić et al., 2016;Schlichtkrull and Søgaard, 2017). For example, one can project only dependency arcs whose words are aligned to target-side words with high confidence (Lacroix et al., 2016). The resulting treebank can be highly noisy due to the auto-parsed source dependency trees. Recently Lacroix et al. (2016) and Rasooli and Collins (2017) propose to filter the results from the large-scale parallel corpus. Our work is different in that the source dependencies are from gold-standard treebanks.

Dependency Parsing
We adopt a state-of-the-art neural BiAffine parser (Dozat and Manning, 2016) as the baseline, which has achieved competitive performances for dependency parsing. The overall architecture is shown in Figure 2. Given an input sentence w 1 · · · w n , the model finds its embedding representations of each word x 1 · · · x n , where Here c i denotes the word cluster of w i , and t i denotes the POS tag. We exploit cross-lingual word embeddings, clusters and also universal POS tags, respectively, which are consistent across language and treebanks. A three-layer deep bidirectional long short term memory (LSTM) neural structure is applied on x 1 · · · x n to obtain hidden vectors h LSTM 1 · · · h LSTM n . For head finding, two nonlinear feed-forward neural layers are used on h LSTM 1 · · · h LSTM n to obtain h dep 1 · · · h dep n and h head 1 · · · h head n . We compute the score for each dependency i j by: (2) The above process is also used for scoring a labeled dependency i l j, by extending the 1-dim vector s into L dims, where L is the total number of dependency labels.
When all scores are ready, a softmax function is used at each position for all candidate heads and labels, and then the normalized scores are used for decoding and training. For decoding, we exploit the MST algorithm to ensure tree-structural outputs. For training, we accumulate the crossentropy loss at the word-level by treating the normalized scores as prediction probabilities. The reader is referred to Dozat and Manning (2016) for more details.

Code-Mixed Treebank Translation
We derive code-mixed trees from source dependency trees by partial translation, projecting words and the corresponding dependencies having highconfidence alignments with machine-translated target sentences. Our approach assumes that sentence level translations and alignment probabilities are available. The motivation is to reduce noise induced by problematic word alignments. We adopt the word-level alignment strategy, which has been demonstrated as effective as phrase-level alignment yet much simpler (Tiedemann et al., 2014;Tiedemann, 2015;Tiedemann and Agić, 2016). Give a source sentence e 1 · · · e n and its target language translation f 1 · · · f m , p(e i |f j ) denotes the probability of word f j being aligned with e i (0 ≤ i ≤ n and 0 < j ≤ m), where e 0 denotes a null word, indicating the no alignment probability for one target word.
The translation process can be conducted by three steps: (1) word substitution, which incrementally substitutes the source words with the target translations; (2) word deletion, which removes several unaligned source words; Algorithm 1 The process of tree translation. m] 10: for i ∈ D do 11: (3) sentence reordering, which reorders the partially translated sentence, ensuring local target language word order.
Algorithm 1 shows pseudocode for code-mixing tree translation, where line 1-8 denotes the first step, line 9-16 denotes the second and line 17 denotes the last step.
There is a hyper-parameter λ to control the overall ratio of translation. The function SE-LECT(S, λ) is to obtain a subset of S by ratio λ with top element values as indicated by v inside S. If λ = 0, it is still the source language dependence tree, since no source word is substituted or deleted. In this condition, our method is equal to  by bridging source and target dependency parsing with universal word representations. If λ = 1, the resulting tree is a fully translated target dependency tree, as all words are target language produced by translation. In this setting, our method is equal to Tiedemann and Agić (2016) where the only difference the our baseline parsing model. Thus our method can be regarded as a generalization of both source-side training  and fully translated target training (Tiedemann and Agić, 2016) with fine-grained control over translation confidence.

Word Substitution
Word substitution is the key step for producing a target treebank. We first obtain the most confidently aligned source word e a j for each target word f j as well as their alignment probability p j = p(e a j |f j ), as shown by line 3 in Algorithm 1. Then we sort the target words by these alignment probabilities, choosing the top mλ words with highest alignment probabilities for substitution. The sorting and choosing is reflected in line line 5 of Algorithm 1. Finally for each chosen word f j and its aligned source word e a j , we replace the source word e a j by f j , as shown by line 7 in Algorithm 1.
One key of the substitution is to maintain the corresponding dependency structures. If e a j and f j bares a one-one mapping, with no other target word being aligned with e a j , the source dependencies are kept unchanged, as shown by Figure 3(a). While if two or more words (i.e., f j 1 , · · · , f j k (j 1 < · · · < j k )) are aligned to e a j , we simply link all words to e a j , with the same dependency label as the original dependency arc. Figure 3(b) illustrates this condition. Both the Swedish words "under" and "tiden" are headed to "hittat" (the Swedish translation of English word "found") by the dependency label "advmod" inherited from the source English side . Note that the POS tags of the substituted words are the same as the corresponding source words.

Word Deletion
There can be source words to which no target word is aligned. These words are typically functional words belonging to source language only, such as "the", "are" and "have". We remove such words to produce dependency trees that are close in syntax to the target language.
In particular, we accumulate the probabilities of p(e i |f j ) for the source word e i who has no aligned target word: where we traverse all target words to sum their alignment probabilities with e i . The value of r i can be interpreted as the confidence score of retention. The words with lower retention scores should be more preferred to be deleted, as these words have lower probabilities aligning with some word of the target language sentence. Concretely, we collect all source words with no aligned target words, computing their retention values, and the selecting a subset of these words with the lowest retention values by the hyper-parameter λ (line 13 in Algorithm 1). Finally we delete all the selected words (line 15 in Algorithm 1). Figure 4 shows an example of word deletion. The two words "are" and "being" both have no aligned words in the other side, and meanwhile "are" has a lower retention score compared with "being". 2 Thus the source word "are" is prefer to be deleted. In most cases, the deleted words are leaf nodes, which can be unattached to the resulted dependency tree and deleted them directly. In case of exceptions, we simply reset the corresponding heads of its child nodes by the head of e i (i.e., h e i ) instead. For example, a dependency w i e i is changed into w i h e i .

Sentence Reordering
Continuous target spans are reordered to make the final code-mixed sentence contain grammatical phrases in the target language. Figure 5(a) shows one example of full sentence reordering. We can see that the word order by word-level substitutions on the source words is different with the 2 Both the two words are only related to the Swedish word "med", but p(are|med) is slightly lower. order of the machine-translated sentence. Thus we adjust the leaf nodes, letting the word order strictly follow the machine-translated sentence order For example, the word "vi" is moved from the first position into the third position, and similarly the word "mu" is moved from the last position into the first position. Concretely, we perform the word reorder by the span level, extracting all the continuous spans of target words, because the target language words may be interrupted by source language words. Then we reorder the words in each span according to their order in the machine translation outputs. Figure 5(b) shows another example, where there are two spans separated by the English word "the". Each span is reordered individually. We do not consider the inconsistent orders inter the spans in this work. Note that this step does not change any dependency arc between words.

Experiments
We conduct experiments to verify the effectiveness of our proposed models in this section.

Settings
Our experiments are conducted on the Google Universal Dependency Treebanks (v2.0) Nivre et al., 2016), using English as the source language, and choosing six languages, including Spanish (ES), German (DE), French (FR), Italian (IT), Portuguese (PT) and Swedish (sv), as the target languages. Google Translate 3 is used to translate the sentences in the English training set into other languages. In order to generate high-quality word-level alignments, we merge the translated sentence pairs and the parallel data of EuroParl (Koehn, 2005) to obtain word alignments. We use the fastAlign tool (Dyer et al., 2013) to obtain word alignments.
We use the cross-lingual word embeddings and clusters by Guo et al. (2016b) for the baseline system. The dimension size of word embeddings is 50 and the word cluster number across of all languages is 256.
For network building and training, we use the same setting as Dozat and Manning (2016), including the dimensional sizes, the dropout ratio, as well as the parameter optimization method. We assume that no labeled corpus is available for the target language. Thus training is performed for 50 iterations over the whole training data without early-stopping.
To evaluate dependency parsing performances, we adopt UAS and LAS as the major metrics, which indicate the accuracies of unlabeled dependencies and labeled dependencies, respectively. We ignore the punctuation words during evaluation following previous work. We run each experiment 10 times and report the averaged results.

Models
We compare performances on the following models: • Delex : The delexicalized BiAffine model without cross-lingual word embeddings and clusters.
• Src : The BiAffine model trained on the source English treebank only.
• PartProj (Lacroix et al., 2016): The Bi-Affine model trained on the corpus by projecting only the source dependencies involving high-confidence alignments into target sentences. Note that the baseline only draws the idea from Lacroix et al. (2016), and the two models are significant different in fact.
• Tgt ( . Figure 6: The LAS relative to the translation ratio λ. • Src+Tgt: The BiAffine model trained on the combination dataset of the source and fully translated target treebanks.

Model
• Mix: The BiAffine model trained on the code-mixed treebank only.
• Src+Mix: The BiAffine model trained on the combination dataset of the source and code-mixed treebanks.
The Src and Tgt methods have been discussed in Section 4. The PartProj model is another way to leverage imperfect word alignments (Lacroix et al., 2016). The training corpus of PartProj may be incomplete dependency trees with a number of words missing heads, because no word is deleted from machine translation outputs. The POS tags of words in PartProj with low-confidence alignments are obtained by a supervised POS tagger (Yang et al., 2018) trained on the corresponding universal treebank.

Development Results
We conduct several developmental experiments on the Swedish dataset to examine important factors to our model.

Influence of The Translation Ratio λ
Our model has an important hyper-parameter λ to control the percentage of translation. Figure 6 shows the influence of this factor, where the percentages increase from 0 to 1 by intervals of 0.  λ of 0 gives our baseline by using the source treebank only. As the λ grows, more source words are translated into the target. We can see that the performance improves after translating some source dependencies into the target, demonstrating the effectiveness of syntactic transferring. The performance reaches the peak when λ = 0.7, but there is a significant drop when λ grows from 0.8 to 0.9. This can be because the newly added dependency arc projections are mostly noisy. This sharp decrease indicates that noise from low-confidence word alignments can have strong impact on the performance. According to the results, we adopt λ = 0.7 for code-mixed treebanking.

Mixing with Source TreeBank
We investigate the effectiveness of the source treebank by merging it into the translated treebanks. First, we show the model performances of Src, Tgt and Mix, which are trained on the individual treebanks, respectively. Then we merge the source treebank with the two translated treebanks, and show the results trained on the merging corpora. Table 1 shows the results. According to the results, we can find that the source treebank is complementary with the translated treebanks. Noticeably, although Src + Mix gives the best performance, its improvement over Mix is relatively smaller than that of Src + Tgt over Tgt. This is reasonable as the code-mixed treebank contains relatively more source treebank content than the fully translated target treebank.

Ablation Studies
The overall translation is conducted by three steps as mentioned in Section 4, where the first word substitution is compulsory, and the remaining two steps aim to build better mixed dependency trees.
Here we conduct ablation studies to test the effectiveness of word deletion and sentence reordering. Table 2 shows the experimental results. We can see both steps are important for dependency tree translation. Without word deletion and sen-tence reordering, the mix model shows decreases of 0.82 and 0.65 on LAS, respectively. If both are removed, the performance is only comparable with the baseline src model (see Table 1).

Final Results
We show the final results of our proposed models in Table 3. As shown, the model Tgt gives better averaged performance compared to Src. However, its results on French and Italian are slightly worse, which indicates that noise from translation impacts the quality of the projected treebank. PartProj model, we conduct preliminary experiments on Swedish to tune the ratio of the projected dependencies. The results show that the difference is very small (δ = 0.24 for UAS) between 0.9 to 1.0, and the performance degrades significantly as the ratio decreases below 0.9. The observation indicates that this method is probably not effective for filtering low-confidence word alignments. The final results confirm our hypothesis. As shown in Table 3, the PartProj model gives only comparable performance with Src. One possible reason may be the unremoved target words (if the words are removed, the PartProj model with ratio 1.0 will be identical to Tgt), which have been demonstrated noisy previously (Tiedemann and Agić, 2016).

Comparison with Previous Work
We compare our method with previous work in the literature.   Guo15 , Guo16 (Guo et al., 2016b) and TA16 (Tiedemann and Agić, 2016). Our model gives the best performance with one exception on the German language. One possible reason may be that TA16 has exploited multiple sources of treebanks besides English. The second block shows representative annotation projection models, including MX14 (Ma and Xia, 2014), RC15 (Rasooli and Collins, 2015), LA16. The models of annotation projection can be complementary with our work, since they build target training corpus from raw parallel texts. The best-performed results of the RC17 model (Rasooli and Collins, 2017) have demonstrated this point, which can be regarded as a combination of the dictionary-based treebank translation 4 (Zhao et al., 2009) and RC15.

Analysis
We conduct experimental analysis on the Spanish (ES) dataset to show the differences between the models of Src, Tgt and Mix. 4 The method has been demonstrated worse than TA16 in Tiedemann and Agić (2016).  Figure 7 show the F-scores of labeled dependencies on different POS tags. We list the six most representative POS tags. The Mix model achieves the best F-scores on 5 of 6 POS tags, with the only exception on tag ADV which has no significant difference with the Tgt model. The Mix and

Performance Relative to POS Tags
Tgt models are much better than the Src model as a whole, especially on the POS tag ADJ where an increase of over 20% has been achieved. In addition, we find that the Src model can significantly outperform the Tgt model on ADP and DET. For Spanish, ADP words are typically "de", "en", "con" and etc., which behave similarly to the English words such as "'s", "to" and "of". The Spanish words of DET include "el", "la", "su" and etc., which are similar to the English words such as "the" and "a". These words are highly ambiguous for automatic word alignment. The results indicate that our Mix model can better handle these word alignment noise, mitigating their negative influence of treebank translation, while the Tgt model suffers from such noise. Figure 8 show the F-scores of labeled dependencies by different arc distances. Particularly we treat the root type as one special case. According to the results, the Mix model performs the best over all distances, indicating its effectiveness on treebank transferring. The Tgt model achieves better performance than the Src model with one exception on distance 2. We look into the dependency patterns of distance 2 arcs further, finding that the dependency arc ADP * accounts for over 30%, and it is the major source of errors. The finding is consistent with that on POS tags, denoting the effectiveness of the code-mixed treebank in handling noise. In addition, as the distance increases the performance drops gradually. The Fscore of root dependency is the highest.

Conclusion
We proposed a new treebank translation method for unsupervised cross-lingual dependency parsing. Unlike previous work, which adopts full-scale translation for source dependency trees, we investigated partial translation instead, producing synthetic code-mixed treebanks. The method can better leverage imperfect word alignments between source and target sentence pairs, translating only high-confidence source sentential words, thus generating dependencies in high-quality. Experimental results on Universal Dependency Treebak v2.0 showed that partial translation is highly effective, and code-mixed treebanks can give significantly better results than full-scale translation. Our method is complementary with several other methods for cross-lingual transfer, such as annotation projection, and thus can be further integrated with these methods.