Density-Driven Cross-Lingual Transfer of Dependency Parsers

We present a novel method for the cross-lingual transfer of dependency parsers. Our goal is to induce a dependency parser in a target language of interest without any direct supervision: instead we assume access to parallel translations be-tween the target and one or more source languages, and to supervised parsers in the source language(s). Our key contributions are to show the utility of dense projected structures when training the target language parser, and to introduce a novel learning algorithm that makes use of dense structures. Results on several languages show an absolute improvement of 5.51% in average dependency accuracy over the state-of-the-art method of (Ma and Xia, 2014). Our average dependency accuracy of 82.18% compares favourably to the accuracy of fully supervised methods.


Introduction
In recent years there has been a great deal of interest in dependency parsing models for natural languages. Supervised learning methods have been shown to produce highly accurate dependencyparsing models; unfortunately, these methods rely on human-annotated data, which is expensive to obtain, leading to a significant barrier to the development of dependency parsers for new languages. Recent work has considered unsupervised methods (e.g. (Klein and Manning, 2004;Headden III et al., 2009;Gillenwater et al., 2011;Mareček and Straka, 2013;Spitkovsky et al., 2013;Le and Zuidema, 2015;Grave and Elhadad, 2015)), or methods that transfer linguistic structures across languages (e.g. (Cohen et al., 2011;McDonald et al., 2011;Ma and Xia, 2014;Tiedemann, 2015; * Currently on leave at Google Inc. New York. Zhang and Barzilay, 2015;Xiao and Guo, 2015)), in an effort to reduce or eliminate the need for annotated training examples. Unfortunately the accuracy of these methods generally lags quite substantially behind the performance of fully supervised approaches. This paper describes novel methods for the transfer of syntactic information between languages. As in previous work (Hwa et al., 2005;Ganchev et al., 2009;McDonald et al., 2011;Ma and Xia, 2014), our goal is to induce a dependency parser in a target language of interest without any direct supervision (i.e., a treebank) in the target language: instead we assume access to parallel translations between the target and one or more source languages, and to supervised parsers in the source languages. We can then use alignments induced using tools such as GIZA++ (Och and Ney, 2000), to transfer dependencies from the source language(s) to the target language (example projections are shown in Figure 1). A target language parser is then trained on the projected dependencies.
Our contributions are as follows: • We demonstrate the utility of dense projected structures when training the target-language parser. In the most extreme case, a "dense" structure is a sentence in the target language where the projected dependencies form a fully projective tree that includes all words in the sentence (we will refer to these structures as "full" trees). In more relaxed definitions, we might include sentences where at least some proportion (e.g., 80%) of the words participate as a modifier in some dependency, or where long sequences (e.g., 7 words or more) of words all participate as modifiers in some dependency. We give empirical evidence that dense structures give particularly high accuracy for their projected dependencies.  Figure 1: An example projection from English to German in the EuroParl data (Koehn, 2005). The English parse tree is the output from a supervised parser, while the German parse tree is projected from the English parse tree using translation alignments from GIZA++.
• We describe a training algorithm that builds on the definitions of dense structures. The algorithm initially trains the model on full trees, then iteratively introduces increasingly relaxed definitions of density. The algorithm makes use of a training method that can leverage partial (incomplete) dependency structures, and also makes use of confidence scores from a perceptron-trained model.
In spite of the simplicity of our approach, our experiments demonstrate significant improvements in accuracy over previous work. In experiments on transfer from a single source language (English) to a single target language (German, French, Spanish, Italian, Portuguese, and Swedish), our average dependency accuracy is 78.89%. When using multiple source languages, average accuracy is improved to 82.18%. This is a 5.51% absolute improvement over the previous best results reported on this data set, 76.67% for the approach of (Ma and Xia, 2014). To give another perspective, our accuracy is close to that of the fully supervised approach of (McDonald et al., 2005), which gives 84.29% accuracy on this data. To the best of our knowledge these are the highest accuracy parsing results for an approach that makes no use of treebank data for the language of interest.

Related Work
A number of researchers have considered the problem of projecting linguistic annotations from the source to the target language in a parallel corpus (Yarowsky et al., 2001;Hwa et al., 2005;Ganchev et al., 2009;Spreyer and Kuhn, 2009;McDonald et al., 2011;Ma and Xia, 2014). The projected annotations are then used to train a model in the target language. This prior work involves various innovations such as the use of posterior regularization (Ganchev et al., 2009), the use of entropy regularization and parallel guidance (Ma and Xia, 2014), the use of a simple method to transfer delexicalized parsers across languages (McDonald et al., 2011), and a method for training on partial annotations that are projected from source to target language (Spreyer and Kuhn, 2009). There is also recent work on treebank translation via a machine translation system (Tiedemann et al., 2014;Tiedemann, 2015). The work of (McDonald et al., 2011) and (Ma and Xia, 2014) is most relevant to our own work, for two reasons: first, these papers consider dependency parsing, and as in our work use the latest version of the Google universal treebank for evaluation; 1 second, these papers represent the state of the art in accuracy. The results in (Ma and Xia, 2014) dominate the accuracies for all other papers discussed in this related work section: they report an average accuracy of 76.67% on the languages German, Italian, Spanish, French, Swedish and Portuguese; this evaluation includes all sentence lengths.
Other work on unsupervised parsing has considered various methods that transfer information from source to target languages, where parsers are available in the source languages, but without the use of parallel corpora (Cohen et al., 2011;Dur-rett et al., 2012;Naseem et al., 2012;Duong et al., 2015;Zhang and Barzilay, 2015). These results are somewhat below the performance of (Ma and Xia, 2014). 2

Our Approach
This section describes our approach, giving definitions of parallel data and of dense projected structures; describing preliminary exploratory experiments on transfer from German to English; describing the iterative training algorithm used in our work; and finally describing a generalization of the method to transfer from multiple languages.

Parallel Data Definitions
We assume that we have parallel data in two languages. The source language, for which we have a supervised parser, is assumed to be English. The target language, for which our goal is to learn a parser, will be referred to as the "foreign" language. We describe the generalization to more than two languages in §3.5. We use the following notation. Our parallel data is a set of examples (e (k) , f (k) ) for k = 1 . . . n, where each e (k) is an English sentence, and each f (k) is a foreign sentence. Each e (k) = e (k) is a word, and s k is the length of k'th source sentence. Similarly, is a word, and t k is the length of k'th foreign sentence.
A dependency is a four-tuple (l, k, h, m) where l ∈ {e, f } is the language, k is the sentence number, h is the head index, m is the modifier index. Note that if l = e then we have 0 ≤ h ≤ s k and 1 ≤ m ≤ s k , conversely if l = f then 0 ≤ h ≤ t k and 1 ≤ m ≤ t k . We use h = 0 when h is the root of the sentence. For is not aligned to anything. We have A k,0 = 0 for all k: that is, the root in one language is always aligned to the root in the other language.
In our experiments we use intersected alignments from GIZA++ (Och and Ney, 2000) to provide the A k,j values.

Projected Dependencies
We now describe various sets of projected dependencies. We use D to denote the set of all dependencies in the source language: these dependencies are the result of parsing the English side of the translation data using a supervised parser. Each dependency (l, k, h, m) ∈ D is a four-tuple as described above, with l = e. We will use P to denote the set of all projected dependencies from the source to target language. The set P is constructed from D and the alignment variables A k,j as follows: We say the k'th sentence receives a full parse under the dependencies P if the dependencies (f, k, h, m) for k form a projective tree over the entire sentence: that is, each word has exactly one head, the root symbol is the head of the entire structure, and the resulting structure is a projective tree. We use T 100 ⊆ {1 . . . n} to denote the set of all sentences that receive a full parse under P. We then define the following set, We say the k'th sentence receives a dense parse under the dependencies P if the dependencies of the form (f, k, h, m) for k form a projective tree over at least 80% of the words in the sentence. We use T 80 ⊆ {1 . . . n} to denote the set of all sentences that receive a dense parse under P. We then define the following set, We say the k'th sentence receives a span-s parse where s is an integer if there is a sequence of at least s consecutive words in the target language that are all seen as a modifier in the set P. We use S s to refer to the set of all sentences with a span-s parse. We define the sets Finally, we also create datasets that only include projected dependencies that are consistent with respect to part-of-speech (POS) tags for the head and modifier words in source and target data. We assume a function POS(k, j, i) which returns TRUE if the POS tags for words f (k) j and e (k) i are consistent. The definition of POS-consistent projected dependencies is then as follows: We experiment with two definitions for the POS function. The first imposes a hard constraint, that the POS tags in the two languages must be identical. The second imposes a soft constraint, that the two POS tags must fall into the same equivalance class: the equivalence classes used are listed in §4.1. Given this definition ofP, we can create sets P 100 ,P 80 ,P ≥7 ,P ≥5 , andP ≥1 , using analogous definitions to those given above.

Preliminary Experiments with Transfer from English to German
Throughout the experiments in this paper, we used German as the target language for development of our approach. Table 1 shows some preliminary results on transferring dependencies from English to German. We can estimate the accuracy of dependency subsets such as P 100 , P 80 , P ≥7 and so on by comparing these dependencies to the dependencies from a supervised German parser on the same data. That is, we use a supervised parser to provide gold standard annotations. The full set of dependencies P give 74.0% accuracy under this measure; results for P 100 are considerably higher in accuracy, ranging from 83.0% to 90.1% depending on how POS constraints are used. As a second evaluation method, we can test the accuracy of a model trained on the P 100 data. The benefit of the soft-matching POS definition is clear. The hard match definition harms performance, presumably because it reduces the number of sentences used to train the model.
Throughout the rest of this paper, we use the soft POS constraints in all projection algorithms. 3

The Training Procedure
We now describe the training procedure used in our experiments. We use a perceptron-trained shift-reduce parser, similar to that of (Zhang and Nivre, 2011). We assume that the parser is able 3 The hard constraint is also used by Ma and Xia (2014).
Definitions: Functions TRAIN, CDECODE, TOP as defined in §3.4. Algorithm: Output: Parameter vectors θ 1 , θ 2 , θ 3 , θ 4 . to operate in a "constrained" mode, where it returns the highest scoring parse that is consistent with a given subset of dependencies. This can be achieved via zero-cost dynamic oracles . We assume the following definitions: • TRAIN(D) is a function that takes a set of dependency structures D as input, and returns a model θ as its output. The dependency structures are assumed to be full trees: that is, they correspond to fully projected trees with the root symbol as their root.
• CDECODE(P, θ) is a function that takes a set of partial dependency structures P, and a model θ as input, and as output returns a set of full trees D. It achieves this by constrained decoding of the sentences in P under the model θ, where for each sentence we use beam search to search for the highest scoring projective full tree that is consistent with the dependencies in P.
• TOP(D, θ) takes as input a set of full trees D, and a model θ. 2 for definitions of P, P 100 etc. Columns labeled "Acc." show accuracy when the output of a supervised German parser is used as gold standard data. Columns labeled "#sen" show number of sentences. "dense" shows P 100 ∪ P 80 ∪ P ≥7 and "Train" shows accuracy on test data of a model trained on the P 100 trees.
200,000 trees that the perceptron is most confident on. 4 Figure 2 shows the learning algorithm. It generates a sequence of parsing models, θ 1 . . . θ 4 . In the first stage of learning, the model is initialized by training on P 100 . The method then uses this model to fill in the missing dependencies on P 80 ∪ P ≥7 using the CDECODE method; this data is added to P 100 and the model is retrained. The method is iterated, at each point adding in additional partial structures (note that P ≥7 ⊆ P ≥5 ⊆ P ≥1 , hence at each stage we expand the set of training data that is parsed using CDECODE).

Generalization to Multiple Languages
We now consider the generalization to learning from multiple languages. We again assume that the task is to learn a parser in a single target language, for example German. We assume that we now have multiple source languages. For example, in our experiments with German as the target, we used English, French, Spanish, Portuguese, Swedish, and Italian as source languages. We assume that we have fully supervised parsers for all source languages. We will consider two methods for combining information from the different languages: Method 1: Concatenation In this approach, we form sets P, P 100 , P 80 , P ≥7 etc. from each of the languages separately, and then concatenate 5 the data to give new definitions of P, P 100 ,P 80 , P ≥7 etc.
Method 2: Voting In this case, we assume that each target language sentence is aligned to a source language sentence in each of the source languages. This is the case, for example, in the Europarl data, where we have translations of the same material into multiple languages. We can then create the set P of projected dependencies using a voting scheme. For any word (k, j) seen in the target language, each source language will identify a headword (this headword may be NULL if there is no alignment giving a dependency). We simply take the most frequent headword chosen by the languages. After creating the set P, we can create subsets such as P 100 , P 80 , P ≥7 in exactly the same way as before.
Once the various projected dependency training sets have been created, we train the dependency parsing model using the algorithm given in §3.4.

Experiments
We now describe experiments using our approach. We first describe data and tools used in the experiments, and then describe results.

Data and Tools
Data We use the EuroParl data (Koehn, 2005) as our parallel data and the Google universal treebank (v2; standard data)  as our evaluation data, and as our training data for the supervised source-language parsers. We use seven languages that are present in both Europarl and the Google universal treebank: English (used only as the source language), and German, Spanish, French, Italian, Portuguese and Swedish.
Word Alignments We use Giza++ 6 (Och and Ney, 2000) to induce word alignments. Sentences with length greater than 100 and single-word sentences are removed from the parallel data. We follow common practice in training Giza++ for both translation directions, and taking the intersection of the two sets as our final alignment. Giza++ de-  Table 2: Parsing accuracies of different methods on the test data using the gold standard POS tags. The models θ 1 . . . θ 4 are described in §3.4. "en→trgt" is the single-source setting with English as the source language. "concat→trgt" and "voting→trgt" are results with multiple source languages for the concatenation and voting methods fault alignment model is used in all of our experiments.
The Parsing Model For all parsing experiments we use the Yara parser 7 (Rasooli and Tetreault, 2015), a reimplementation of the k-beam arc-eager parser of Zhang and Nivre (2011). We use a beam size of 64, and Brown clustering features 8 (Brown et al., 1992;Liang, 2005). The parser gives performance close to the state of the art: for example on section 23 of the Penn WSJ treebank (Marcus et al., 1993), it achieves 93.32% accuracy, compared to 92.9% accuracy for the parser of (Zhang and Nivre, 2011).

POS Consistency
As mentioned in §3.2, we define a soft POS consistency constraint to prune some projected dependencies. A source/target language word pair satisifies this constraint if one of the following conditions hold: 1) the POS tags for the two words are identical; 2) the word forms for the two words are identical (this occurs frequently for numbers, for example); 3) both tags are in one of the following equivalence classes: These rules were developed primarily on German, with some additional validation on Spanish. These rules required a small amount of human engineering, but we view this as relatively negligible.
Parameter Tuning We used German as a target language in the development of our approach, and in setting hyper-parameters. The parser is 7 https://github.com/yahoo/YaraParser 8 https://github.com/percyliang/ brown-cluster trained using the averaged structured perceptron algorithm (Collins, 2002) with max-violation updates (Huang et al., 2012). The number of iterations over the training data is 5 when training model θ 1 in any setting, and 2, 1 and 4 when training models θ 2 , θ 3 , θ 4 respectively. These values are chosen by observing the performance on German. We use θ 4 as the final output from the training process: this is found to be optimal in English to German projections.

Results
This section gives results of our approach for the single source, multi-source (concatenation) and multi-source (voting) methods. Following previous work (Ma and Xia, 2014) we use goldstandard part-of-speech (POS) tags on test data. We also provide results with automatic POS tags.
Results with a Single Source Language The first set of results are with a single source language; we use English as the source in all of these experiments. Table 2 shows the accuracy of parameters θ 1 . . . θ 4 for transfer into German, Spanish, French, Italian, Portuguese, and Swedish. Even the lowest performing model, θ 1 , which is trained only on full trees, has a performance of 75.88%, close to the 76.15% accuracy for the method of (Ma and Xia, 2014). There are clear gains as we move from θ 1 to θ 4 , on all languages. The average accuracy for θ 4 is 78.89%. Table 2 shows results using multiple source languages, using the concatenation method. In these experiments for a given target language we use all other languages in our     Table 4). All results are reported on gold part of speech tags. The numbers in parentheses are absolute improvements over (Ma and Xia, 2014). Sup (1st) is the supervised first-order dependency parser used by (Ma and Xia, 2014) and sup(ae) is the Yara arc-eager supervised parser (Rasooli and Tetreault, 2015).

Results with Multiple Source Languages, using Concatenation
data as source languages. The performance of θ 1 improves from an average of 75.88% for a single source language, to 79.76% for multiple languages. The performance of θ 4 gives an additional improvement to 81.23%.
Results with Multiple Source Languages, using Voting The final set of results in Table 2 are for multiple languages using the voting strategy.
There are further improvements: model θ 1 has average accuracy of 80.95%, and model θ 4 has average accuracy of 82.18%.
Results with Automatic POS Tags We use our final θ 4 models to parse the treebank with automatic tags provided by the same POS tagger used for tagging the parallel data. Table 3 shows the results for the transfer methods and the supervised parsing models of (McDonald et al., 2011) and(Rasooli andTetreault, 2015). The first-order supervised method of (McDonald et al., 2005) gives only a 1.7% average absolute improvement in ac-curacy over the voting method. For one language (Swedish), our method actually gives improved accuracy over the 1st order parser. Table 4 gives a comparison of the accuracy on the six languages, using the single source and multiple source methods, to previous work. As shown in the table, our model outperforms all models: among them, the results of (McDonald et al., 2011) and (Ma and Xia, 2014) are directly comparable to us because they use the same training and evaluation data. The recent work of (Xiao and Guo, 2015) uses the same parallel data but evaluates on CoNLL treebanks but their results are lower than Ma and Xia (2014). The recent work of  evaluates on the same data as ours but uses different parallel corpora. They only reported on three languages (German: 60.35, Spanish: 71.90 and French: 72.93) which are all far bellow our results. The work of (Grave and Elhadad, 2015) is the state-of-the-art fully unsupervised model with L en → trg concat voting P80 ∪ P ≥7 P100 P80 ∪ P ≥7 P100 P80 ∪ P ≥7 P100 sen# dep# len acc. sen# len acc. sen# dep# len acc. sen# len acc. sen# dep# len acc. sen# dep# acc. de 34k 9.6 28.3 84.7 18k 6.8 85.8 98k 9.4 28.  Table 5: Table showing statistics on projected dependencies for the target languages, for the singlesource, multi-source (concat) and multi-source (voting) methods. "sen#" is the number of sentences. "dep#" is the average number of dependencies per sentence. "len" is the average sentence length. "acc." is the percentage of projected dependencies that agree with the output from a supervised parser.

Comparison to Previous Results
minimal linguistic prior knowledge. The model of (Zhang and Barzilay, 2015) does not use any parallel data but uses linguistic information across languages. Their semi-supervised model selectively samples 50 annotated sentences but our model outperforms their model.
Compared to the results of (McDonald et al., 2011) and (Ma and Xia, 2014) which are directly comparable, there are clear improvements across all languages; the highest accuracy, 82.18%, is a 5.51% absolute improvement over the average accuracy for (Ma and Xia, 2014).

Analysis
We conclude with some analysis of the accuracy of the projected dependencies for the different languages, for different definitions (P 100 , P 80 etc.), and for different projection methods. Table 5 gives a summary of statistics for the various languages. Recall that German is used as the development language in our experiments; the other languages can be considered to be test languages. In all cases the accuracy reported is the percentage match to a supervised parser used to parse the same data.
There are some clear trends. The accuracy of the P 100 datasets is high, with an average accuracy of 84.7% for the single source method, 88.3% for the concatenation method, and 89.0% for the voting method. The voting method not only increases accuracy over the single source method, but also increases the number of sentences (from an average 17k to 77k) and the average number of dependencies per sentence (from 6.8 to 10.4).
The accuracy of the P 80 ∪ P ≥7 datasets is slightly lower, with around 83-87% accuracy for the single source, concatenation and voting methods. The voting method gives a significant increase in the number of sentences-from an av-erage of 140k to 243k. The average sentence length for this data is around 28 words, considerably longer than the P 100 data; the addition of longer sentences is very likely beneficial to the model. For the voting method the average number of dependencies is 13.7, giving an average density of 50% on these sentences.
The accuracy for the different languages, in particular for the voting data, is surprisingly uniform, with a range of 85.8-91.4% for the P 100 data, and 81.3-87.4% for the P 80 ∪ P ≥7 data. The number of sentences for each language, the average length of those sentences, and average number of dependencies per sentence is also quite uniform, with the exception of German, which is a clear outlier. German has fewer sentences, and fewer dependencies per sentence: this may account for it having the lowest accuracy for our models. Future work should investigate why this is the case: one hypothesis is that German has quite different word order from the other languages (it is V2, and verb final), which may lead to a degradation in the quality of the alignments from GIZA++, or in the projection process.
Finally, figure 3 shows some randomly selected examples from the P 100 data for Spanish, giving a qualitative feel for the data obtained using the voting method.

Conclusions
We have described a density-driven method for the induction of dependency parsers using parallel data and source-language parsers. The key ideas are a series of increasingly relaxed definitions of density, together with an iterative training procedure that makes use of these definitions. The method gives a significant gain over previous methods, with dependency accuracies approach-El informe presentado por la red abarca una serie de temas muy vasta . ROOT ing the level of fully supervised methods. Future work should consider application of the method to a broader set of languages, and application of the method to transfer of information other than dependency structures.