How to Parse Low-Resource Languages: Cross-Lingual Parsing, Target Language Annotation, or Both?

To develop a parser for a language with no syntactically annotated data, we either have to develop a (small) treebank for the target language or rely on cross-lingual learning or projection, or possibly use some combination of these methods. In this paper, we compare the usefulness of cross-lingual model transfer and target language annotation for three different languages, with varying support from closely related high-resource languages. The results show that annotating even a very small amount of data in the target language is superior to any cross-lingual setup and that accuracy can be further improved by adding training data from related languages in a multilingual model.


Introduction
Despite significant advances in natural language processing over several decades, even basic technologies like part-of-speech tagging and syntactic parsing are still available only for a tiny fraction of the languages of the world. This observation has led to an increasing interest in techniques for supporting low-resource languages, typically by making use of data from high-resource languages together with methods for cross-lingual learning or transfer. These techniques include annotation projection (Hwa et al., 2002), model transfer (Zeman and Resnik, 2008;McDonald et al., 2011), treebank translation , and multilingual parsing models (Duong et al., 2015a;Ammar et al., 2016). Despite the undeniable progress in this line of research, the question always looms large whether it is not more effective to simply annotate a small amount of training data in the target language of interest. Daniel Zeman, one of the inventors of delexicalized transfer parsing, maintains that you can get over 50% accuracy for many languages with just 100 annotated sentences, citing as evidence the results of Ramasamy (2014) for some Indian languages. Further support comes from the study of , who compares cross-lingual parsing to target language annotation in the context of building a treebank for Galician.
In this paper, we approach this question by comparing three ways of training dependency parsers for low-resource languages: monolingual models trained on small amounts of target language data; crosslingual models trained only on data from related support languages; and multilingual models trained on both support and target language data. We perform experiments on three target languages with varying support from related high-resource languages: Faroese (supported by Danish, Norwegian, and Swedish), Upper Sorbian (supported by Czech, Polish, and Slovak), and North Saami (supported by Estonian, Finnish, and Hungarian). Our results show that monolingual models consistently outperform crosslingual models even with very limited amounts of training data. In addition, there is always a multilingual model that outperforms the best monolingual model. Taken together, these results suggest that the most effective strategy for low-resource parser development may well be to annotate as much data as you can afford in the target language and then add training data from related languages if available.

Methodology
To be able to compare monolingual, cross-lingual and multilingual models, we adopt the multilingual parsing approach pioneered by Ammar et al. (2016) and deployed on a large scale by Smith et al. (2018a)  in the 2018 CoNLL shared task on universal dependency parsing (Zeman et al., 2018). This approach differs from early work on model transfer, which relied on delexicalized models with part-of-speech tags as pivot features (Zeman and Resnik, 2008;McDonald et al., 2011). Although these models initially gave encouraging results, especially for closely related languages, the results were mostly based on experiments with gold part-of-speech tags, severely overestimating the accuracy achievable under more realistic conditions (Tiedemann, 2015). We instead use lexicalized models, which do not presuppose part-of-speech tagging or any other preprocessing except tokenization for the target language, and instead rely on word, character and language embeddings. Besides being more realistic in a low-resource setting, this is justified by the reduced importance of part-of-speech tagging for neural dependency parsers (Dozat et al., 2017;Ma et al., 2018;Smith et al., 2018b).

Languages and Treebanks
From Universal Dependencies v2.3 (Nivre et al., 2016;, we select three language clusters with one low-resource language and three related support languages with larger treebanks: a Scandinavian cluster with Faroese supported by Danish, Norwegian (Nynorsk) and Swedish; a West Slavic cluster with Upper Sorbian supported by Czech, Polish and Slovak; and a Uralic cluster with North Saami supported by Estonian, Finnish and Hungarian. It is worth noting that the support languages are much more closely related to the target language in the Scandinavian and West Slavic clusters than in the Uralic cluster. Table 1 lists the treebanks used for each language and the number of tokens in each data set.

Parser
We use UUParser v2.3 (de Lhoneux et al., 2017a;Smith et al., 2018a), which is an adaptation of the transition-based parser of Kiperwasser and Goldberg (2016) specifically for multilingual models. The original parsing architecture relies on a BiLSTM to learn representations of tokens in context and a multilayer perceptron to predict transitions and arc labels based on a few BiLSTM vectors. The multilingually motivated extensions in UUParser include an extended transition system for handling non-projective structures (de Lhoneux et al., 2017b) and a richer representation of input tokens. More specifically, each input token w i in language l is represented by: Here x is the concatenation of a word embedding e(w), a character-based vector BiLSTM(ch 1:m ) obtained by running a BiLSTM over the characters ch 1:m of w, and a treebank embedding e(t) representing the  Table 2: Test set accuracy for target languages (UAS, LAS). −Target = cross-lingual models trained without target language data. +Target = models trained on target language data; monolingual (first row) and multilingual.
treebank t that the input comes from. The treebank embedding is used to distinguish data from different languages as well as different treebanks from the same language (Stymne et al., 2018;Smith et al., 2018a). We drop the treebank embedding when training models on a single treebank and otherwise train all models with default settings and no pre-trained embeddings. For cross-lingual and multilingual models, word, character and language embeddings are thus learned jointly for all languages.

Experimental Setup
Within each cluster, we train cross-lingual models on data from every combination of one, two or three support languages (7 models), multilingual models on the same data sets plus target language data (7 models), and a monolingual model only on target language data, for a total of 15 models. For the support languages, we only use the dedicated training sets from Universal Dependencies v2.3 . We do not standardize training set sizes, since the parser has been shown to be robust to size differences when training multi-treebank models (Stymne et al., 2018), but we limit the size of the Czech training set (which is about four times bigger than any other) to 300k tokens. For the target languages, we need a training set for the mono-and multilingual models, a development set to tune hyper-parameters, and a test set for the final evaluation. For Faroese and Upper Sorbian, there is only about 10k tokens of data, which we subdivide into 50% training, 25% development, and 25% test. For North Saami, there is more data, so we leave the dedicated test set of 10k tokens intact and extract a development set of 2.5k tokens from the training set, leaving 14.3k words for training.
The development sets for target languages are used for model selection as follows: • For support languages with two treebanks of roughly equal size, we run preliminary experiments with cross-lingual models to decide whether to use both or only one. The resulting selection can be seen in Table 1.
• To improve compatibility of character-based representations across (support and target) languages, we try mapping characters that exist only in a target language to characters that exist in one or more support language. This is helpful only for Faroese, where we map {ÍÚýúðí} to {IUyudi}.
• For cross-lingual models, the parser does not learn a language embedding for the target language, so we select a support language to use as proxy during parsing based on LAS on the target language development set.
• All models are trained for 30 epochs, and the best epoch is selected according to LAS on the target language development set.
Finally, the development sets are also used in learning curve experiments (see Section 3). Table 2 reports results on the test sets for all cross-lingual, multilingual and monolingual models on our three target languages, both labeled attachment score (LAS) and unlabeled attachment score (UAS). The first thing to note is that the monolingual models, trained only on about 5k tokens for Faroese and Upper Sorbian and 14k tokens for North Saami, consistently outperform all cross-lingual models by a wide margin. The difference is especially large for the Uralic cluster, where the target language is in a different branch of the language family from all support languages, and where the best cross-lingual model does not even reach 10% for LAS (25% for UAS). But even for the Scandinavian and West Slavic clusters, where languages are more closely related and the best cross-lingual models get LAS over 40% and UAS over 50%, the monolingual model gives a higher LAS score by at least 15% absolute (12% absolute for UAS). This indicates that annotating a relatively small amount of training data in the target language is generally superior to using cross-lingual model transfer.
The second main trend is that, despite the poor results for cross-lingual models, the best multilingual models consistently outperform the monolingual models. For the Scandinavian and West Slavic clusters, all multilingual models outperform the monolingual model and the best model improves by as much as 5.9/6.9 LAS and 5.3/6.5 UAS. But even for the Uralic cluster, where data from the related support languages seem completely useless in the cross-lingual scenario, using the same data in a multilingual model improves on the monolingual model in 3 out of 7 cases for UAS (2 out of 7 for LAS). The relative improvement for the best multilingual model is smaller than in the other two clusters, but it should be kept in mind that the target language training set is almost three times bigger for North Saami than for Faroese and Upper Sorbian. These results suggest that, even if target language annotation is more effective than cross-lingual transfer, adding data from related support languages can nevertheless lead to further improvements.
To understand why multilingual models work so much better than cross-lingual models, it is important to note that the former learn word, character and language embeddings for the target language and that these embeddings are learned together for all languages. The cross-lingual models have no target language specific representations and have to rely on a proxy language embedding and the existence of cognates for matching word and character representations. This works especially poorly for the Uralic cluster, where the distance from the target to the support languages is much larger.
So how much target language data do we need to outperform a cross-lingual model? To answer this question, we run learning curve experiments for the monolingual and best multilingual models, using the development sets for evaluation, and gradually increasing the amount of target language training data from 0 to 50, 100, 500, 1k, 3k, 5k and 10k tokens (Figure 1). For the Scandinavian and Uralic clusters, we only need 1k tokens for the monolingual model to surpass the cross-lingual model with respect to LAS. For the West Slavic cluster, results are slightly erratic for the smallest training sets, but 3k tokens definitely suffice to reach the accuracy of the best cross-lingual model. 1 In all three cases, this is less than 200 sentences, 2 so the results seem to support Daniel Zeman's claim that something like 100 sentences can be sufficient to train a decent parser, although in our study it is only Faroese that reaches a (labeled) accuracy of 50% with only 100 sentences.

Related Work
Work on cross-lingual learning for parsing and related tasks has focused on three main approaches: annotation projection (Hwa et al., 2002;Hwa et al., 2005;Tiedemann, 2014), model transfer (Zeman and Resnik, 2008;McDonald et al., 2011), and (to a lesser extent) treebank translation (Tiedemann et al., 2014). Annotation projection and treebank translation presupposes parallel data, so we will focus on model transfer, which is closest to our work. Model transfer was pioneered for closely related languages by Zeman and Resnik (2008), using delexicalized models and relying on a common part-of-speech tagset for the source and target language. The idea was refined and generalized to multi-source transfer by 1 If we consider UAS instead of LAS, the patterns are very similar, with the curves crossing at around 1k tokens for the Scandinavian and Uralic clusters and just under 3k tokens for the West Slavic cluster, so we omit these figures to save space.
2 Upper Sorbian has significantly longer sentences than the other two target languages. . McDonald et al. (2011) and gained further momentum with the advent of cross-linguistically consistent syntactic annotation, which facilitated evaluation (McDonald et al., 2013). Other studies concerned methods for selecting optimal source languages (Søgaard and Wulff, 2012;Rosa and Zabokrtsky, 2015). However, most of the early studies of model transfer relied on evaluation with gold part-of-speech tags on the target side, which was later shown to give over-optimistic results (Tiedemann, 2015).
A study of special relevance to our own work is that of , who specifically study the amount of target language training data needed to outperform cross-lingual model transfer in the context of building a UD treebank for Galician. Drawing on data from 7 other Romance language varieties (Brazilian Portuguese, Catalan, European Portuguese, French, Italian, Romanian and Spanish), they show that a single-source transfer parser achieves LAS corresponding to about 3,000 tokens of target language training data and UAS corresponding to about 7,000 tokens. However, they also show that careful combination and adaptation of source language data from multiple languages can increase these numbers to 16,000 (LAS) and 20,000 (UAS). One difference between their study and ours is that they make use of part-of-speech tags as pivot features, which may explain why especially the adapted multi-source transfer parsers seem to perform better than in our study. In addition, the similarity between Galician and some of the Romance languages is probably greater than in most of our support-target language pairs. Another difference is that  find that cross-lingual parsers are more competitive with respect to UAS than LAS, whereas we find that about the same number of target language training tokens is needed to reach cross-lingual performance with respect to both metrics. It is possible but by no means obvious that this difference is also related to the presence or absence of part-of-speech tags.
The increasing use of neural networks and distributed representations in syntactic parsing has led to more flexible models for cross-lingual and multilingual learning embeddings that go beyond delexicalized models and their reliance on part-of-speech tags (Duong et al., 2015a;Duong et al., 2015b;Guo et al., 2015a;Guo et al., 2015b). Especially important for our own work is the multilingual model of Ammar et al. (2016) with its use of language embeddings, which were later generalized to treebank embeddings that allow seamless integration of multiple languages as well as heterogeneous treebanks for a single language (de Lhoneux et al., 2017a;Stymne et al., 2018;Smith et al., 2018a). A more recent line of research involves the use of synthetic treebanks (Wang and Eisner, 2016;Wang and Eisner, 2018), an approach recently applied to parser development for one of our target languages, Faroese (Tyers et al., 2018). Finally, it is worth noting that the superiority of annotating target language data over using cross-lingual methods has also been demonstrated for the related part-of-speech tagging problem, in the context of historical text processing, by Schultz and Kuhn (2016) and Schultz and Ketchik (2019).

Conclusion
We have compared cross-lingual, multilingual and monolingual parser training for three low-resource languages, supported to different degrees by related languages with more resources. Our main conclusion is that training a monolingual model on target language data gives better performance than any crosslingual model as soon as we have at least 200 annotated target language sentences. Moreover, adding data from related languages to train a multilingual model can improve performance further by up to 7 LAS points. In conclusion, to develop a parser for a low-resource language, annotate as much data as you can afford and add data from related languages if available.