Cross-lingual Dependency Transfer : What Matters? Assessing the Impact of Pre- and Post-processing

,


Introduction
Supervised learning techniques nowadays lie at the core of most Natural Language Processing (NLP) tools. Their use is however hindered by the scarcity of annotated data, which are only available for a restricted number of tasks, genres, domain, and languages. The supervision information that exists for well-resourced languages can however be transferred to under-resourced languages through the use of cross-lingual techniques. In this work, we focus on the transfer of syntactic dependency annotations.
Two main transfer strategies have been proposed in the literature: direct transfer model and annotation transfer. The first approach is mainly based on delexicalized parsing (Zeman and Resnik, 2008;McDonald et al., 2011) which assumes of common morpho-syntactic representation (e.g. PoS tags) between the source and target languages. It has been improved with the use of self-training, data selection, relexicalization and multi-source transfer (Naseem et al., 2010;Cohen et al., 2011;Søgaard, 2011;. The second approach (transfer of annotations), relies on parallel corpora to project, through alignment links, the dependencies automatically predicted from a resource-rich language to a resourcepoor language. This approach pioneered by Hwa et al. (2005) requires various heuristic transformation rules to cope with the non-isomorphism between the source and target structures as well as with the noise in source annotations and in alignments. It has since enjoyed a great popularity and been improved by many works (see the overview in Section 2).
In spite of the simplicity of the annotation transfer principle, all these methods have several (hidden) parameters, such as the symmetrization heuristic or filtering thresholds, that make any direct comparison of their performance very hard. That is why, in this work, we aim at analyzing the impact of external factors used as pre-and post-processing steps and their significance in the whole transfer process. To this end, we propose to use the simple transfer strategy exploiting partially annotated data introduced in Lacroix et al. (2016) to systematically compare various design decisions.
The transfer strategy used in our experiments is explained in Section 2. We then propose to explore and analyze different external factors: projected data filtering (Section 3.1), enhancement of the parsing strategy (Section 3.2) and multi-source transfer (Section 3.3). Finally, we compare the efficiency of dependency transfer and supervised parsing (Section 3.4) and analyze the performance achieved for the different kind of labels (Section 3.5). 20 In this Section, we describe the two main steps of the transfer process considered in our experiments: the projection of the dependencies through word alignments from a source language to a target language and the training method of Lacroix et al. (2016) for transition-based parser that can learn a parser from partial dependency trees. We also present the dataset used in our experiments and evaluate the proposed approach.

Dependency Projection
Most works on dependency transfer use a similar setting. They consider sentence-aligned bitexts for which an automatically parsed text in a resourcerich language is associated with its translation in a target language. Parallel sentences are then aligned in both directions and these alignments are merged with a symmetrization heuristics. To further guarantee their quality, alignments are generally filtered using various hand-crafted rules, for instance, to remove alignment links that associate words with different PoS tags (Rasooli and Collins, 2015) or sentences in which the number of alignment links is too low. Finally dependencies are projected through alignment links.
Dependencies for which both the head and the dependent are each aligned to exactly one word (i.e. 1:1 alignment) can be readily transferred to the target language. Difficulties with the projection arise with many-to-many links and un-aligned tokens. Some heuristics were proposed by Hwa et al. (2005), and reused by Tiedemann (2014), to deal with these multiple alignments. Nonetheless, to avoid this problem several works have proposed ad hoc rules to complete the trees. For instance, Spreyer and Kuhn (2009) propose to attach unaligned tokens to a fake root in order to ignore the actions associated with these dependencies during learning; Li et al. (2014) choose to add all dependency possibilities of the unaligned tokens to preserve ambiguity during learning; Ma and Xia (2014) consider, the potentially noisy, dependencies predicted by a delexicalized parser. Applying these heuristics allows, at the expense of adding potentially fake tokens or noisy dependencies, to label automatically a corpus of full parsing trees in the target language on which a stan-dard learning method can be used.
In this work, we consider another approach and choose to ignore unattached words as well as manyto-many alignments: we focus on the projection through 1:1 alignments that are, intuitively, the most reliable. More precisely, to prevent annotations of noisy dependencies to be transferred, we decide to remove all multiple alignment links, and among the remaining links, we remove the ones that associate words with different PoS (following the soft rules proposed by Rasooli and Collins (2015)). After projection, we filter the sentences in which less than 80% of the words receive a dependency as they often result from bad quality alignments. At the end, we obtain an automatically annotated corpus for the target language that contains partial but accurate annotations. 1 We will describe, in the following Section, how a parser can be trained on such data.
In spite of its simplicity, this way to transfer dependency has several (hidden) parameters, such as the symmetrization heuristic or the filtering threshold, that can have a large impact on the quality of the transferred parser. We will evaluate, in Section 3 trough 3.1 the impact of these design decisions.

Partial Transition-based Learning
We consider a transition-based dependency parser based on the arc-eager algorithm (Nivre, 2003): this parser builds a dependency tree incrementally by performing a sequence of actions. At each step of the parsing process, a classifier scores each possible action and the highest scoring one is applied.
Training relies on the dynamic oracle of Goldberg and Nivre (2012): for each sentence, a parse tree is built incrementally; at each step, if the predicted action creates an erroneous dependency (or, equivalently, prevents the creation of a gold dependency), a weight vector is updated, according to the perceptron rule. The set of all 'correct' actions is built considering the (potentially wrong) predicted tree and the gold action is defined as the correct action with the highest model score.
It is crucial to note that the training algorithm is an error-correction learning procedure that solely relies on its capacity to detect when an action choice will result in an error: when no error is detected, the construction of the parse tree continues according to the model prediction. Consequently, this training procedure can also be used, as such, to train a dependency parser from partially annotated data: when no supervision information is available (no correct dependencies are known), all actions are considered as correct; in this case, the predicted action is necessarily equal to the correct action, the weight vector is not updated, and the training process goes on.

Dataset
All our experiments are carried out on six languages 2 of the Universal Dependency Treebank Project v2.0 (UDT) : German (de), English (en), Spanish (es), French (fr), Italian (it) and Swedish (sv). We consider as parallel corpora a subset of the Europarl corpus that have exactly the same English sentences, collecting 1, 231, 216 parallel sentences for the 6 language pairs. For the evaluation, the original splits (train/test/dev) of the UDT corpora are kept for training the source and evaluate the target.

Experiments
The parallel sentences are aligned in both directions with Giza++ (Och and Ney, 2003). These alignments are then merged with the intersection and grow-diag heuristics. For each language pair, the source dataset (Europarl) is PoS-tagged and parsed using the transition-based version of the MateParser (Bohnet and Nivre, 2012) with a beam of 40, which was trained on the UDT corpus. These predicted annotations are then partially projected on the target language data using the projection strategy described in Section 2.1.
To train a parser on partially projected target data, we used our own implementation of the arceager dependency parser, using the features described in Zhang and Nivre (2011). The greedy version of the parser is used in all but one experiments of the Section 3 while a beam-search (with a beamsize of 8 for learning & parsing) is used to achieve 2 These are the languages present in both Europarl and UDT. the best performances of the proposed method (Section 2.5). 3

Performance
We present the results of our transfer strategy in the table 1. The results are first presented for crosslingual transfer from English and second, applying a voting method for transferring from multiple source languages. 4 These scores are obtained using the most appropriate external factors: filtering of the projected sentences for which less than 80% words are attached, beam-search strategy for parsing the source and target data. The effects of this parameters on the transfer results are clarified respectively in sections 3.1 and 3.2.
This method achieves results that are competitive with recent state-of-the-art methods such as (Ma and Xia, 2014;Rasooli and Collins, 2015), at a much cheaper computational cost, 5 which allows us to make all the experiments required to compare the various design decisions. The results of Table 1 also show that using the grow-diag heuristic to symmetrize the alignments rather than the intersection heuristic hurts performance for all languages. 3 Analysis

The importance of filtering
To assess the usefulness of filtering on transfer performance, we conduct experiments on several lan-guage pairs, considering two symmetrization heuristics: the intersection and the grow-diag heuristics. After the projection step, several greedy parsers are learned for increasing sizes of projected datasets. The sentences are included in the learning set in order of decreasing percentage of attached tokens.
The results for French, German and Swedish are presented respectively in Figure 1. Similar curves are obtained for Italian and Spanish. These results show that adding partially annotated sentences improves parsing performance as long as these sentences have enough attached tokens. For instance, in French (focusing on the scores obtained with the intersection heuristics), a parser trained only on fully labelled sentences achieves a UAS of 75.6%; when sentences in which more than 80% of dependencies are known are added, parsing performance is improved to 76.9%, but adding more sentences hurts performance. Indeed, sentences with a small number of attached tokens correspond to sentence pairs with few alignment links that are often not perfect translation of each other and may have very different grammatical structures.
One can notice that, while the number of sentences needed to reach the top scores varies greatly from a language to another, the average percentage of attached token per sentence remains in a short interval (from 74.9 % (de) to 84.9 % (sv)). Adding more sparse data seems to bring more noise than relevant syntactic information. Controlling the quality of the projected data, over the quantity, is therefore a key point in the success of the transfer process. This observation justifies our decision to consider only sentences with more than 80% of attached tokens.
Finally, the scores obtained with the use of the heuristic of symetrization intersection are mostly higher than those obtained with the grow-diag heuristic. It is worth noting that the number of sentences fully annotated with the grow-diag heuristic is far less important than with the intersection. For instance in French, the training filtered data contains 21,381 sentences when the intersection heuristic is used, but only 6,534 for the grow-diag heuristic. Indeed, the number of projected dependencies is lower because multiple alignments impede the projection of dependencies. This restrains the projection of potentially wrong dependencies from ambiguous alignments but also the diversity of the syn-  tactic information transferred on the target language, and then the parsing performances. In the rest of the paper we will only consider the intersection heuristic. This experiment shows that an appropriate selection of the alignment strategy and thus of the projected data used for learning could benefit the transfer scores of the methods that exploit (raw or even completed) partial data.

Pumping the parsing
It is well known that different techniques can boost parsing performance. For instance, clusters (Koo et al., 2008) may be used to reduce lexical sparseness, which is particularly appropriate in the case of dependency parser transfer since parallel data are generally not from the same domain as the corpus used to train and evaluate the parser. Another approach for boosting parsing performance is the use of a beam-search strategy that reduces the number of search errors (Zhang and Nivre, 2012). In this section, we aim at assessing, first, how parsing performance of the source language impacts the quality of the transferred parser, and second, how using more 'advanced' parsing techniques may boost parsing in the target language.
Using a similar transfer process as in the previous section, we conduct experiments in which the source and target parsers will be progressively enriched: we consider, in a first experiment, a greedy parser to predict the dependencies of the source and target sentences; the source greedy parsers are then replaced by a beam-search parser and features describing Brown clusters learned from the Europarl data are added. Finally, we also consider a beamsearch parser for parsing the target language, using our own implementation of the transition based parser presented in Section 2.2 with a beam size of 8, and enrich it with Brown clusters.
The transfer scores are presented in Table 2. First, these results show that the alignments are not good enough to reflect improvements in (source) parsing quality on the target data: the use of beam-search on parsing the source language allows an average improvement of 0.2 UAS point on the target languages, while the source (English) performance is improved by 2.3. The use of clusters does not improve the  average score, nor the source parsing performance. 6 However, the beam-search strategy is surely useful for parsing the projected target languages: scores are, on average, 1.34 higher. The use of clusters is not interesting: only German performance is improved (+0.3) while Italian and Swedish are both negatively impacted (−0.3 for both).
We have observed that the use of clusters is mostly useless in any case and, globally, that boosting the source parsing performance have very little effect on transfer final scores. However, not surprisingly, the use of beam-search for parsing the target data is highly effective for boosting the transfer scores.

Multisource impact
We have seen in Section 3.1 that filtering the projected data is a key point to achieve good transfer scores, as adding too much data for learning reduces the parsing performance. However, for a similar percentage of attached token, the number of sentences kept for learning varies a lot depending on the target language. For instance, when transferring from English, the number of sentences having more than 80% dependencies reaches 52,554 for Swedish but only 15,191 for German; the parsing scores differ greatly as well: their UAS is, respectively, 81.9 and 73.8. Spanish, Italian and French achieve relatively close scores for the same order of number of sentences (around 20/30K).
These observations show that there is a correlation between the number of dependencies transferred and the parsing performance: the more filtered sentences there are the better the scores are. A natural way to increase the number of sentences with a high number of dependencies is to transfer dependencies from different languages: good projections result from good word alignments, that depend on source and target languages at stake.
We conduct multi-lingual experiments, similar to the experiments from English, in which each language (among German, Spanish, French, Italian and Swedish) is considered as the source language for the other ones. We consider as a parallel corpora a subset of the Europarl corpus (as detailed in Section 2.3). The sentences are filtered as previously after the dependencies have been transferred across alignment links. All reported results are achieved by a greedy parser considering gold PoS-tagged data.
The results of this multi-lingual experiment are presented in Figure 2, with, for every source language, the number of sentences that survived filtering. 7 These results show that, for each target language, the best score is achieved for the source lan-guage with the largest training set (which is: English for German and Swedish, French for Spanish and Italian, and Spanish for French). With the exception of Swedish, for a given target language, the UAS is proportional to the number of sentences, regardless of the source parsing performance. 8 As expected, languages from the same family, such as the ones derived from Latin (Spanish, French and Italian), are beneficial for each other. This observation has already been reported several times (e.g. by Mc-Donald et al. (2013)), but Figure 2 suggests that the increase in performance may mainly result from a good alignment between the source and the target languages. Overall, these results stress the fact that the alignment quality has a large impact on the transfer performance and should not be neglected.
As seen in the previous section, the source parsing performance does not appear to be the parameter having the greatest impact on transfer. The quality of the alignment deriving from the choice of the source language is quite crucial.

Transfer vs Supervised Parsing
Cross-lingual transfer strategies are mainly used with the aim of developing swiftly NLP tools for resource-poor languages without the need of annotated corpora that are expensive to build from scratch. Results presented in the previous sections show that parsers trained on transferred data are still outperformed by supervised parsers. It is however difficult to evaluate how prejudicial this loss is. That is why, to assess the usefulness of transfer methods, we propose to determine the amount of gold annotated sentences needed to achieve performance similar to the performance of transferred parsers.
In a first series of experiments, we compare the performance of a cross-lingual dependency parser to supervised parsers learned from increasing amount of data (starting with 50 sentences). All the experiments are performed on 10 runs to mitigate the impact of selecting labelled data randomly, using the greedy version of the parser. Scores are evaluated on gold PoS-tagged data. The results, presented in Figure 3, 9 show that the amount of supervised data needed to achieve transfer scores are respectively around 250, 200 and 400 sentences for German, French and Swedish. This observation strongly question the interest of cross-lingual transfer: only a very limited amount of annotated data is required to outperform a parser trained on transferred annotations.  show that the performance difference between a supervised and transferred PoS taggers partially results from divergences in annotation conventions and from evaluating the tagger on out-of-domain data: as in the setting described in Section 2, the taggers are trained on Europarl and evaluated on UDT. To assess the impact on parsing performance of this two elements, we propose, in a second series of experiments, to enrich the data labelled automatically by transferring annotations with an increasing amount of in-domain labelled data to learn new parsers.
Results of this experiments are presented in Figure 3. They show that a small amount of supervised data ( 300 sentences) provide useful information to projected data for parsing target languages: the performance of a parsers trained on the combination of transferred and labelled data outperforms both a parser trained on the labelled data only and a parser trained only on the transferred data. However above a specific threshold projected data become useless, and worse, adding them to labelled data hurts parsing performance. This could mean that the projected data, do not just lack syntactic diversity but also contain substantial amount of projection errors, even if the alignments have been filtered with very conservative rules.

Label scores and frequencies
The previous experiments suggest that cross-lingual parsers suffer from alignment errors (or from their absence) even if the alignments are filtered (restricted to 1:1 and PoS-coherent links). To reveal systematic syntactic errors, we propose to examine the transfer scores depending on the (gold) syntactic label of the dependency. Our hypothesis is that the UAS of a given target label depends both on the capacity of the source parser to predict this label and on the ability of the transfer method to project the syntactic information. For the sake of clarity, we only report results for English to French transfer.  Table 3 shows 10 the frequencies of the labels predicted by a supervised parser on the English and French EUROPARL corpora as well as the frequencies of these labels on the French projected (from English) and filtered data. It appears that the frequencies of the labels tend to look like the source frequencies, introducing a systematic bias in the data used to train a cross-lingual parser. In particular, some dependencies, such as the root, are over-projected but no label is entirely skipped. The syntactic information are quite proportionally transferred. Table 3 also reported the supervised UAS of these labels for English and French, as well as the UAS of the transferred parser for French. We observe that the prediction of each label suffers from transfer: each score is generally lower than the (source or target) supervised score. One can also notice that over-projection do not benefit label scores. Overall, it may suggest that projection errors are also quite proportionally transferred among the various labels.
Results of Table 3 also suggest that the capacity of the source parser to predict a given kind of dependency has not much impact on the performance of the target parser. For instance, the ADPOBJ and DET are both very well predicted by the English parser, but the prediction of the cross-lingual French parser are far better for DET than for ADPOBJ Similar scores and frequencies are observed for different pair of source-target languages. In addition, it is worth noting that we observe comparable behaviour when scores are computed depending on the PoS of the tokens. Frequencies are relatively well preserved and the loss in UAS is shared among the PoS tags.

Conclusion
We have proposed to apply a simple method that learns transferred dependency parsers from partially projected data with the aim of analyzing the various parameters that impact the parsing performance. Our observations are valid for many methods of dependency transfer that operates on annotations (partially) projected via alignments links.
We have shown that the selection of the align-  ments (thus the dependencies) and the filtering of the projected data are crucial. The quantity of projected data used for learning is not relevant if quality is not controlled. However, the quantity of training data is correlated to the parsing performance, as quantity is rather a consequence of the quality of the alignments. Finally, the quality of alignment greatly depends on the relation shared between the source and target languages. It appear that all these choices are far more important that the quality of the source parsing.
Moreover, we have seen that performance of transfer techniques still lag behind those of fully supervised learning. Our experiments suggest that many attachment errors are produced during the dependency projection and that these errors are spread over all kind of syntactic phenomena. They surely derived from alignment errors and variation in the annotation scheme between languages. The recent development of more coherent annotation schemes and corpora (universal dependencies (Nivre et al., 2015)) tends to alleviate these problems but there is still work to be done concerning the quality of the alignments. The main difficulty is to preserve enough sentences for learning while preventing the projection of erroneous dependencies. 27