Edinburgh Research Explorer A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

Parsers are available for only a handful of the world’s languages, since they require lots of training data. How far can we get with just a small amount of training data? We systematically compare a set of simple strategies for improving low-resource parsers: data augmentation, which has not been tested before; cross-lingual training; and transliteration. Experimenting on three typologically diverse low-resource languages—North Sámi, Galician, and Kazah—We find that (1) when only the low-resource treebank is available, data augmentation is very helpful; (2) when a related high-resource treebank is available, cross-lingual training is helpful and complements data augmentation; and (3) when the high-resource treebank uses a different writing system, transliteration into a shared ortho-graphic spaces is also very helpful.


Introduction
Large annotated treebanks are available for only a tiny fraction of the world's languages, and there is a wealth of literature on strategies for parsing with few resources (Hwa et al., 2005;Zeman and Resnik, 2008;McDonald et al., 2011;Søgaard, 2011).A popular approach is to train a parser on a related high-resource language and adapt it to the low-resource language.This approach benefits from the availability of Universal Dependencies (UD; Nivre et al., 2016), prompting substantial research (Tiedemann and Agic, 2016;Agić, 2017;Rosa and Mareček, 2018), along with the VarDial and the CoNLL UD shared tasks (Zampieri et al., 2017;Zeman et al., 2017Zeman et al., , 2018)).
But low-resource parsing is still difficult.The organizers of the CoNLL 2018 UD shared task (Zeman et al., 2018) report that, in general, results on the task's nine low-resource treebanks "are extremely low and the outputs are hardly useful for downstream applications."So if we want to build a parser in a language with few resources, what can we do?To answer this question, we systematically compare several practical strategies for lowresource parsing, asking: 1. What can we do with only a very small target treebank for a low-resource language?
2. What can we do if we also have a source treebank for a related high-resource language?
3. What if the source and target treebanks do not share a writing system?
Each of these scenarios requires different approaches.Data augmentation is applicable in all scenarios, and has proven useful for low-resource NLP in general (Fadaee et al., 2017;Bergmanis et al., 2017;Sahin and Steedman, 2018).Transfer learning via cross-lingual training is applicable in scenarios 2 and 3. Finally, transliteration may be useful in scenario 3. To keep our scenarios as realistic as possible, we assume that no taggers are available since this would entail substantial annotation.Therefore, our neural parsing models must learn to parse from words or characters-that is, they must be lexicalized-even though there may be little shared vocabulary between source and target treebanks.While this may intuitively seem to make crosslingual training difficult, recent results have shown that lexical parameter sharing on characters and words can in fact improve cross-lingual parsing (de Lhoneux et al., 2018); and that in some circumstances, a lexicalized parser can outperform a delexicalized one, even in a low-resource setting (Falenska and C ¸etinoglu, 2017).
We experiment on three language pairs from different language families, in which the first of each is a genuinely low-resource language: North Sámi and Finnish (Uralic); Galician and Portuguese (Romance); and Kazakh and Turkish (Turkic), which have different writing systems1 .To avoid optimistic evaluation, we extensively experiment only with North Sámi, which we also analyse to understand why our cross-lingual training outperforms the other parsing strategies.We treat Galician and Kazakh as truly held-out, and test only our best methods on these languages.Our results show that: 1.When no source treebank is available, data augmentation is very helpful: dependency tree morphing improves labeled attachment score (LAS) by as much as 9.3%.Our analysis suggests that syntactic rather than lexical variation is most useful for data augmentation.
2. When a source treebank is available, crosslingual parsing improves LAS up to 16.2%, but data augmentation still helps, by an additional 2.6%.Our analysis suggests that improvements from cross-lingual parsing occur because the parser learns syntactic regularities about word order, since it does not have access to POS and has little reusable information about word forms.
3. If source and target treebanks have different writing systems, transliterating them to a common orthography is very effective.

Methods
We describe three techniques for improving lowresource parsing: (1) two data augmentation methods which have not been applied before for dependency parsing, (2) cross-lingual training, and (3) transliteration.

Data augmentation by dependency tree morphing (Morph)
Sahin and Steedman (2018) introduce two operations to augment a dataset for low-resource POS tagging.Their method assumes access to a dependency tree, but they do not test it for dependency parsing, which we do here for the first time.sentence.The second operation, rotation, keeps all the words in the sentence but re-orders subtrees attached to the root verb, in particular those attached by NSUBJ (nominal subject), OBJ (direct object), IOBJ (indirect object), or OBL (oblique nominal) dependencies.Figure 1 illustrates both operations.
It is important to note that while both operations change the set of words or the word order, they do not change the dependencies.The sentences themselves may be awkward or ill-formed, but the corresponding analyses are still likely to be correct, and thus beneficial for learning.This is because they provide the model with more examples of variations in argument structure (cropping) and in constituent order (rotation), which may benefit languages with flexible word order and rich morphology.Some of our low-resource languages have these properties-while North Sámi has a fixed word order (SVO), Galician and Kazakh have relatively free word order.All three languages use case marking on nouns, so word order may not be as important for correct attachment.
Both rotation and cropping can produce many trees.We use the default parameters given in (Sahin and Steedman, 2018).

Data augmentation by nonce sentence generation (Nonce)
Our next data augmentation method is adapted from Gulordava et al. (2018).The main idea is to create nonce sentences by replacing some of the words which have the same syntactic annotations.For each training sentence, we replace each content word-nouns, verbs, or adjectivewith an alternative word having the same universal POS, morphological features, and dependency label. 2 Specifically, for each content word, we first stochastically choose whether to replace it; then, if we have chosen to replace it, we uniformly sample the replacement word type meeting the corresponding constraints.For instance, given a sentence "He borrowed a book from the library.",we can generate the following sentences: (1) a.He bought a book from the shop .
b.He wore a umbrella from the library .
This generation method is only based on syntactic features (i.e., morphology and dependency labels), so it sometimes produces nonsensical sentences like 1b.But since we only replace words if they have the same morphological features and dependency label, this method preserves the original tree structures in the treebank.Following (Gulordava et al., 2018), we generate five nonce sentences for each original sentence.

Cross-lingual training
When a source treebank is available, model transfer is a viable option.We perform model transfer by cross-lingual parser training: we first train on both source and target treebanks to produce a single model, and then fine tune the model only on the target treebank.In our preliminary experiments (Appendix A), we found that fine tuning on the target treebank was effective in all settings, so we use it in all applicable experiments reported in this paper.

Transliteration
Two related languages might not share a writing system even when they belong to the same family.We evaluate whether a simple transliteration would be helpful for cross-lingual training in this case.In our study, the Turkish treebank is 2 The dependency label constraint is new to this paper.
written in extended Latin while the Kazakh treebank is written in Cyrillic.This difference potentially makes model transfer less useful, and means we might not be able to leverage lexical similarities between the two languages.We pre-process both treebanks by transliterating them to the same "pivot" alphabet, basic Latin. 3he mapping from Turkish is straightforward.Its alphabet consists of 29 letters, 23 of which are in basic Latin.The other six letters, 'c ¸','g', 'ı', 'ö', 's ¸', and 'ü', add diacritics to basic Latin characters, facilitating different pronunciations. 4We map these to their basic Latin counterparts, e.g., 'c ¸' to 'c'.For Kazakh, we use a simple dictionary created by a Kazakh computational linguist to map each Cyrillic letter to the basic Latin alphabet.5 .

Dependency Parsing Model
We use the Uppsala parser, a transition-based neural dependency parser (de Lhoneux et al., 2017a,b;Kiperwasser and Goldberg, 2016).The parser uses an arc-hybrid transition system (Kuhlmann et al., 2011), extended with a static-dynamic oracle and SWAP transition to allow non-projective dependency trees (Nivre, 2009).
Let w = w 0 , . . ., w |w| be an input sentence of length |w| and let w 0 represent an artificial ROOT token.We create a vector representation for each input token w i by concatenating (; ) its word embedding, e w (w i ) and its character-based word embedding, e c (w i ): Here, e c (w i ) is the output of a character-level bidirectional LSTM (biLSTM) encoder run over the characters of w i (Ling et al., 2015); this makes the model fully open-vocabulary, since it can produce representations for any character sequence.We then obtain a context-sensitive encoding h i using a word-level biLSTM encoder: We then create a configuration by concatenating the encoding of a fixed number of words on the top of the stack and the beginning of the buffer.Given this configuration, we predict a transition and its arc label using a multi-layer perceptron (MLP).
More details of the core parser can be found in de Lhoneux et al. (2017a,b).

Parameter sharing
To train cross-lingual models, we use the strategy of de Lhoneux et al. ( 2018) for parameter sharing, which uses soft sharing for word and character parameters, and hard sharing for the MLP parameters.Soft parameter sharing uses a language embedding, which, in theory, learns what parameters to share between the two languages.Let c j be an embedding of character c j in a token w i from the treebank of language k, and let l k be the language embedding.For sharing on characters, we concatenate character and language embedding: [c j ; l k ] for input to the character-level biLSTM.
Similarly, for input to the word-level biLSTM, we concatenate the language embedding to the word embedding, modifying Eq. 1 to We use the default hyperparameters of de Lhoneux et al. (2018) in our experiments.We fine-tune each model by training it further only on the target treebank (Shi et al., 2016).We use early stopping based on Label Attachment Score (LAS) on development set.

Datasets
We use Universal Dependencies (UD) treebanks version 2.2 (Nivre et al., 2018).None of our target treebanks have a development set, so we generate new train/dev splits by 50:50 (Table 1).Having large development sets allow us to perform better analysis for this study.different strategies before testing on the other languages.To understand the effect of target treebank size, we generate three datasets with different training sizes: T 10 (∼10%), T 50 (∼50%), and T 100 (100%).Table 2 reports the number of training sentences after we augment the data using methods described in Section 2. We apply MORPH and NONCE separately to understand the effect of each method and to control the amount of noise in the augmented data.We employ two baselines: a monolingual model ( §3.1) and a cross-lingual model ( §2.3), both without data augmentation.The monolingual model acts as a simple baseline, to resemble a situation when the target treebank does not have any source treebank (i.e., no available treebanks from related languages).The cross-lingual model serves as a strong baseline, simulating a case when there is a source treebank.We compare both baselines to models trained with MORPH and NONCE augmentation methods.Table 3 reports our results, and we review our motivating scenarios below.

Parsing North Sámi
Scenario 1: we only have a very small target treebank.In the monolingual experiments, we observe that both dependency tree morphing (MORPH) and nonce sentence generation (NONCE) improve performance, indicating the strong benefits of data augmentation when there are no other resources available except the target treebank itself.In particular, when the number of training data is the lowest (T 10 ), data augmentations improves performance up to 9.3% LAS.
Scenario 2: a source treebank is available.We see that the cross-lingual training (cross-base) performs better than monolingual models even with augmentation.For the T 10 setting, cross-base achieves almost twice as much as the monolingual baseline (mono-base).The benefits of data augmentation are less evident in the cross-lingual setting, but in the T 10 scenario, data augmentation still clearly helps.Overall, cross-lingual combined with data augmentation yields the best result.

What is learned from Finnish?
Why do cross-lingual training and data augmentation help?To put this question in context, we first consider their relationship.Finnish and North Sámi are mutually unintelligible, but they are typologically similar: of the 49 (mostly syntactic) linguistic features annotated for North Sámi in the Word Atlas of Languages (WALS; Dryer and Haspelmath, 2013), Finnish shares the same values for 42 of them. 6Despite this and their phylogenetic and geographical relatedness, they share very little vocabulary: only 6.5% of North Sámi tokens appear in Finnish data, and these words are either proper nouns or closed class words such as pronouns or conjunctions.However, both languages do share many character-trigrams (72.5%, token-level), especially on suffixes.Now we turn to an analysis of the T 10 data setting, where we see the largest gains for all methods.

Analysis of data augmentation
For dependency parsing, POS features are important because they can provide strong signals whether there exists dependency between two words in a given sentence.For example, subject and object dependencies often occur between a NOUN and a VERB, as can be seen in Fig. 1a.We investigate the extent to which data augmentation is useful for learning POS features, using diagnostic classifiers (Veldhoen et al., 2016;Adi et al., 2016;Shi et al., 2016)  Table 4 reports the POS prediction accuracy.We observe that representations generated with monolingual MORPH seem to learn better POS, for most of the tags.On the other hand, representations generated with monolingual NONCE sometimes produce lower accuracy on some tags; only on nouns the accuracy is better than monolingual MORPH.We hypothesize that this is because NONCE sometimes generates meaningless sentences which confuse the model.In parsing this effect is less apparent, mainly because monolingual NONCE has the poorest POS representation for infrequent tags (%dev), and better representa-

Effects of cross-lingual training
Next, we analyze the effect of cross-lingual training by comparing the monolingual baseline to the cross-lingual model with MORPH.
Cross-lingual representations.The fact that cross-lingual model improves parsing performance is interesting, since Finnish and North Sámi have so little common vocabulary.What linguistic knowledge is transferred through cross-lingual training?We analyze whether words with the same POS category from the source and target treebanks have similar representations.To do this, we analyze the head predictions, and collect North Sámi tokens for which only the crosslingual model correctly predicts the headword. 7 For these words, we compare token-level representations of North Sámi development data to Finnish training data.
We ask the following questions: Given the representation of a North Sámi word, what is the Finnish word with the most similar representation?Do they share the same POS category?Information other than POS may very well be captured, but we expect that the representations will reflect similar POS since POS is highly revelant to parsing.We use cosine distance to measure similarity.
We look at four categories for which crosslingual training substantial improves results on the development set: adjectives, nouns, pronouns, and verbs.We analyze representations generated by two layers of the model in §3.1: (1) the output of character-level biLSTM (char-level), e c (w i ) and 7 Another possible way is to look at the label predictions.But since the monolingual baseline LAS is very low, we focus on the unlabeled attachment prediction since it is more accurate.(2) the output of word-level biLSTM (word-level), i.e., h i in Eq. 2.

POS
Table 5 shows examples of top three closest Finnish training words for a given North Sámi word.We observe that character-level representation focuses on orthographic similarity of suffixes, rather than POS.On the word-level representations, we find more cases when the top closest Finnish words have the same POS with the North Sámi word.In fact, when we compare the most similar Finnish word (Table 6) quantitatively, we find that the word-level representations of North Sámi are often similar to Finnish word with the same POS; the same trend does not hold for character-level representations.Since very few word tokens are shared, this suggests that improvements in cross-lingual training might simply be due to syntactic (i.e.word order) similarities between the two languages, captured in the dynamics of the biLSTM encoder-despite the fact that it knows very little about the North Sámi tokens themselves.The word-level representation has advantage over the char-level representation in the way that it has access to contextual information like word order, and it has knowledge about the other words in the sentence.
Head and label prediction.Lastly, we analyze the parsing performance of the monolingual com-  pared to the cross-lingual models.Looking at the produced parse trees, one striking difference is that monolingual model sometimes predicts a "rootless" tree.That is, it fails to assign a head of any word with index '0' and label the dependency with a root label.In cases where the monolingual model predicts wrong parses and the crosslingual model predicts the correct ones, we find that the "rootless" trees are predicted more than 50% of the time. 8Meanwhile, the cross-lingual model learns to assign a word with head index '0', although sometimes it is the incorrect word (e.g., it is the second word, but the parser predicts the fifth word).This pattern suggests that more training examples at least helps the model to learn structural properties of a well-formed tree.
The ability of a parser to predict labels is contingent on its ability to predict heads, so we focus our analysis on two cases.How do monolingual and cross-lingual head prediction compare?And if both models predict the correct head, how do they compare on label prediction?
Figure 2 shows the difference between two confusion matrices: one for cross-lingual and one for monolingual models.The last column shows cases of incorrect heads and the other columns show label predictions when the heads are correct, i.e., each row summing to 100%.Here, blue cells 8 The parsing model enforces the constraint that every tree should have a head, i.e., an arc pointing from a dummy root to a node in the tree.It does not, however, enforce that this arc be labeled root-the model must learn the labeling.Table 7: LAS results on development sets.zero-shot denotes results where we predict using model trained only on the source treebank.
highlight confusions that are more common for the cross-lingual model, while red cells highlight those more common for the monolingual model.For head prediction (last column), we observe that monolingual model makes higher errors especially for nominals and modifier words.In cases when both both models predict the correct heads, we observe that cross-lingual training gives further improvements in predicting most of the labels.In particular, regarding the "rootless" trees discussed before, we see evidence that cross-lingual training helps in predicting the correct root index, and the correct root label.

Parsing truly low-resource languages
Now we turn to two truly low-resource treebanks: Galician and Kazakh.These treebanks are most analogous to the North Sami T 10 setting and therefore we apply the best approach, cross-lingual training with MORPH augmentation.(Straka, 2018) for Galician and Uppsala (Smith et al., 2018) for Kazakh.rank shows our best model position in the shared task ranking for each treebank.
a strong baseline, in a case when we have access to pre-trained word embeddings, for the source and/or the target languages.
We treat a pre-trained word embedding as an external embedding, and concatenate it with the other representations, i.e., modifying Eq. 3 to x i = [e w (w i ); e p (w i ); e c (w i ); l k ], where e p (w i ) represents a pre-trained word embedding of w i , which we update during training.We use the pre-trained monolingual fastText embeddings (Bojanowski et al., 2017). 9We concatenate the source and target pre-trained word embeddings. 10For our experiments with transliteration ( §2.4), we transliterate the entries of both the source and the target pre-trained word embeddings.

Experimental results
Table 7 reports the LAS performance on the development sets.MORPH augmentation improves performance over the zero-shot baseline and achieves comparable or better LAS with a cross-lingual model trained with pre-trained word embeddings.
Next, we look at the effects of transliteration (see Kazakh vs Kazakh (translit.) in Table 7).In the zero-shot experiments, simply mapping both Turkish and Kazakh characters to the Latin alphabet improves accuracy from 12.5 to 21.2 LAS.Cross-lingual training with MORPH further improves performance to 36.7 LAS.

Comparison with CoNLL 2018
To see how our best approach (i.e., cross-lingual model with MORPH augmentation) compares with the current state-of-the-art models, we compare it to the recent results from CoNLL 2018 shared task.Training state-of-the-art models may require lots of engineering and data resources.Our goal, however, is not to achieve the best performance, but rather to systematically investigate how far simple approaches can take us.We report performance of the following: (1) the shared task baseline model (UDPipe v1.2; Straka and Straková, 2017) and (2) the best system for each treebank, (3) our best approach, and (4) a cross-lingual model with fastText embeddings.
Table 8 presents the overall comparison on the test sets.For each treebank, we apply the same sentence segmentation and tokenization used by each best system. 11We see that our approach outperforms the baseline models on both languages.For Kazakh, our model (with transliteration) achieves a competitive LAS (28.23), which would be the second position in the shared task ranking.As comparison, the best system for Kazakh (Smith et al., 2018) trained a multitreebank model with four source treebanks, while we only use one source treebank.Their system use predicted POS as input, while ours depends solely on words and characters.The use of more treebanks and predicted POS is beyond the scope of our paper, but it is interesting that our approach can achieve the second best with such minimal resources.For Galician, our best approach outperforms baseline by 8.09 LAS points.Note that, Galician treebank does not come with training data.We use 50:50 train/dev split, while other teams might use higher split for training (for example, the best system (Straka, 2018) uses 90:10 train/dev split).Since we treat Galician as our test data, we did not tune on the proportion for training data, but we guess that this is the main reason why our system achieve rank 10 out of 27.
Compared to cross-lingual models with fastText embeddings (fastText vs. MORPH), we observe that our approach achieve better or comparable performance, showing its potential when there is not enough monolingual data available for training word embeddings.

Conclusions
In this paper, we investigated various low-resource parsing scenarios.We demonstrate that in the extremely low-resource setting, data augmentation improves parsing performance both in monolingual and cross-lingual settings.We also show that transfer learning is possible with lexicalized parsers.In addition, we show that transfer learning between two languages with different writing systems is possible, and future work should consider transliteration for other language pairs.While we have not exhausted all the possible techniques (e.g., use of external resources (Rasooli and Collins, 2017;Rosa and Mareček, 2018), predicted POS (Ammar et al., 2016), multiple source treebanks (Lim et al., 2018;Stymne et al., 2018), among others), we show that simple methods which leverage the linguistic annotations in the treebank can improve low-resource parsing.Future work might explore different augmentation methods, such as the use of synthetic source treebanks (Wang and Eisner, 2018) or contextualized language model (Peters et al., 2018;Howard and Ruder, 2018;Devlin et al., 2018) for scoring the augmented data (e.g., using perplexity).
Finally, while the techniques presented in this paper might be applicable to other low-resource languages, we want to also highlight the importance of understanding the characteristics of languages being studied.For example, we showed that although North Sami and Finnish do not share vocabulary, cross-lingual training is still helpful because they share similar syntactic structures.Different language pairs might benefit from other types of similarity (e.g., morphological) and investigating this would be another interesting future work for low-resource dependency parsing.

A Effects of Fine-Tuning for Cross-Lingual Training
For our cross-lingual experiments in Section 2.3, we observe that fine-tuning on the target treebank always improves parsing performance.

B Cyrillic to Latin Alphabet mapping
We use the following character mapping for Cyrillic to Latin Kazakh treebank transliteration.

Figure 1 :
Figure 1: Examples of dependency tree morphing operations on the sentence "She wrote me a letter".

Figure 2 :
Figure 2: between cross-lingual vs. monolingual confusion matrices.The last column represents cases of incorrect heads and the other columns represent cases for correct heads, i.e., each row summing to 100%.Blue cells show higher cross-lingual values and red cells show higher monolingual values.

Table 1 :
Train/dev split used for each treebank.

Table 2 :
Number of North Sámi training sentences.
North Sámi is our largest low-resource treebank, so we use it for a full evaluation and analysis of

Table 3 :
LAS results on North Sámi development data.mono-base and cross-base are models without data augmentation.% improvements over mono-base shown in parentheses.

Table 4 :
to probe our model representations.Our central question is: do the models learn useful representations of POS, despite having no direct access to it?And if so, is this helped by data augmentation?After training each model, we freeze the parameters and generate context-dependent representa-6 There are 192 linguistic features in WALS, but only 49 are defined for North Sámi.These features are mostly syntactic, annotated within different areas such as morphology, phonology, nominal and verbal categories, and word order.Results for the monolingual POS predictions, ordered by the frequency of each tag in the dev split (%dev).%diff shows the difference between each augmentation method and monolingual models.
tions (i.e., the output of word-level biLSTM, h i in

Table 5 :
Most similar Finnish words for each North Sámi word based on cosine similarity.
tion of nouns.

Table 6 :
# of North Sámi tokens for which the most similar Finnish word has the same POS.

Table 8 :
Comparison to CoNLL 2018 UD Shared Task on test sets.best system is the state-of-the-art model for each treebank: UDPipe-Future Table 9 reports LAS for cross-lingual models with and without fine-tuning.

Table 9 :
Effects of fine-tuning on North Sámi development data, measured in LAS.mono-base and crossbase are models without data augmentation.% improvements over mono-base shown in parentheses.