Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages

The lack of annotated data is a big issue for building reliable NLP systems for most of the world’s languages. But this problem can be alleviated by automatic data generation. In this paper, we present a new data augmentation method for artificially creating new dependency-annotated sentences. The main idea is to swap subtrees between annotated sentences while enforcing strong constraints on those trees to ensure maximal grammaticality of the new sentences. We also propose a method to perform low-resource experiments using resource-rich languages by mimicking low-resource languages by sampling sentences under a low-resource distribution. In a series of experiments, we show that our newly proposed data augmentation method outperforms previous proposals using the same basic inputs.


Introduction
Data sparsity has been a problem since the beginning of natural language processing. Neural networks have not solved it and have made it even more visible with their hunger for data. Hence a revived interest in data augmentation, since artificially created annotation can prove very useful to tackle the lack of manually annotated training data.
Many approaches have been proposed to perform data augmentation. Some rely on external resources, such as unannotated raw text in order to iteratively increase their training data with automatically annotated examples (McClosky et al., 2006;Yu and Bohnet, 2015). However, this is an error-prone method and useful annotations must be separated from harmful ones. Other proposals instead rely solely on available annotated data in order to generate new examples (Şahin and Steedman, 2018).
In this paper, we present a language-agnostic approach to the automatic generation of dependencyannotated sentences. Our method essentially swaps compatible subtrees from different sentences in order to generate new annotated sentences. By enforcing a number of constraints on the subtrees to be swapped, we avoid to generate too ungrammatical sentences. Furthermore, contrary to previous work that kept the syntactic structure of sentences or impoverished it, our method injects structures from other sentences, so it can introduce more syntactic complexity in the generated sentences.
In order to assess the potential of our new data augmentation method for low-resource languages in a way that is independent from the specific sentences in low-resource treebanks, we propose a method to mimic low-resource language data using high-resource language data. By sampling sentences from highresource languages using a low-resource language distribution (over sentence length), we can perform the same experiment several times in a more faithful low-resource setting, while dampening the role played by each individual sentence on the final results, which can be prominent in very low-resource settings.
While there are many methods to parsing low-resource languages, such as unsupervised parsing or cross-lingual transfer, in this paper we consider a very restrained setting where one has only access to mono-lingual parsing data and nothing more. This very strict setting addresses two research problems.
How to analyse a language that at the same time is an isolate and uses its own writing system (e.g. Japanese, Korean) at least in the data if we do not have typological information? And, how much can we learn from a very limited amount of training data?

Related Work
Automatically generated annotation has been used for dependency parsing for at least two decades. Mc-Closky et al. (2006) used a combination of a k-best parser and a discriminative reranker in order to increase their training set size with automatically parsed sentences. Later McDonald et al. (2011) and Wróblewska and Przepiórkowski (2012) proposed various projection techniques to create annotated data for languages that did not have any, relying to different extents on parallel corpora.
Sentence morphing itself is not new either. Wang and Eisner (2018) and Rasooli and Collins (2019) proposed to reorder sentences from a source language in order to better match the word order of a target language.  proposed not only to reorder words but also to delete some such as determiners when they are irrelevant for the target language.
However, these methods rely on external resources (parse trees in other languages and/or unparsed sentences from the same language) to create new data. And creating new examples this way introduces noise in the training data. In order to avoid this kind of problems, one can directly try to expand the little annotated data available.
Şahin and Steedman (2018) proposed to transform gold dependency trees directly by rotating their core arguments and deleting subtrees in order to generate new sentences for training POS taggers for low-resource languages. While similar in idea to their work, justifying experimental comparison, our work is different in two important aspects. First, their data generation methods and ours differ both in the grammaticality constraints and in the shape of the generated sentences. In addition, while the data generation process morphs dependency trees, they evaluated it on POS tagging. It was thus unclear how their data augmentation method would behave on dependency parsing, where the data generation process and the success measure both look at the same structure.
Later Vania et al. (2019) evaluated the data augmentation technique of Şahin and Steedman (2018) directly on dependency parsing as part of a wider investigation on methods for parsing low-resourced languages. Their results showed that creating sentences by morphing trees indeed helps parsers. They also considered replacing words by words with agreeing morphology to create new "nonce" sentences after Gulordava et al. (2018) and looked at cross-lingual training, which is also beneficial.
In this paper, we depart from their work by focusing on gold data morphing only. We propose a new data augmentation method that creates new "gold" parses from existing ones by swapping compatible subtrees. We also propose an experimental method for mimicking a low-resource setting using high resource languages.

Data Augmentation
In this section we present a language-agnostic method to automatically generate high-quality dependency annotated data from a small amount of (manually) annotated sentences. Beside being as languageagnostic as possible, our method should (1) create new structures and (2) create grammatically sound sentences.
While the cropping operation of Şahin and Steedman (2018) creates new sentences, it has two main issues. The first one is that sentences created this way may be ill-formed. For example, French finite verbs always need an overt subject, and removing it creates ungrammatical sentences. The second one is that it creates new sentences by impoverishing existing ones. Cropped trees are by essence smaller and simpler than their source trees. This may be a problem since small treebanks already have less material and simpler structures than bigger ones.
To avoid these issues, we propose to create new sentences by swapping subtrees between sentences under constraints we describe below. Swapping subtrees between sentences creates both simpler and more complex sentences, and strong constraints ensure that these sentences are as grammatical as possible.  Figure 1: Illustration of tree swapping operation between sentences. Sentences c and d result from swapping the direct objects (represented with an incoming dashed arrow) from sentences a and b. Sentence c is more complex than a since its new direct object has an adjective. Figure 1 shows an example of tree swapping that creates both a more complex and a simpler sentence. In sentence c, the new direct object has an adjective modifying the noun, while in a, the noun had just a determiner. Swapping subtrees can introduce other new complexities in otherwise simple sentences, such as relative clauses. Sentences d and b show the inverse phenomenon.

Grammatical Constraints
In the following, we assume Universal Dependencies annotations (UD) (Zeman et al., 2019), which we also use for the experiments, but the idea behind our data augmentation technique can be applied to other types of tree annotations given the relevant information is easily accessible to the head of a subtree.
To be as language-agnostic as possible while trying to keep maximal grammaticality, we enforce the root of the subtrees being swapped to have the same POS tag, morphological features and dependency relation. Of course, enforcing those three constraints at the same time is far too rigid for most languages, but we think it is a good compromise to provide a reasonable solution for every language, since knowing which constraint can be relaxed is highly language specific.
For example, as English or French do not mark case on their noun phrases, a subject noun phrase could be replaced by an object one as long as they agree in number (and sometimes in gender). However in Japanese, case is marked with clitics which in UD attach to their modifying nouns but the nouns themselves do not have any case marking. Therefore, swapping noun phrases whose roots agree in POS tag and morphological features, but not in dependency relation would often lead to ungrammatical sentences. Relaxing the morphological feature constraint would cause similar troubles in languages like Hebrew or Amharic in which verbs inflect for the gender and number of the subject. Finally, relaxing the POS tag constraint could cause troubles in Turkic languages where nominal direct objects can appear both in the nominative and accusative case while pronominal direct objects can only appear in the accusative.

Structural Constraints
In this paper, we only consider a subset of the available POS tags and dependency relations in order to guarantee the quality of the generated sentences. Regarding POS tags, we only consider NOUN, PROPN, ADJ and VERB. We avoid the other parts-of-speech because they could lead to more ungrammaticality. For example, replacing a form of the auxiliary (AUX) to be in a continuous construction with the corresponding form of to have would result in an ungrammatical sentence (I am eating. > *I have eating.). Similarly, adverbs have such a wide range of uses that replacing any adverb by any other would likely yield incorrect sentences.
Regarding relations, we consider all core arguments (nominal and clausal), all non core dependents but discourse, expl and dislocated, and all nominal dependents. We ignore all the remaining relations, including conjunctions, multi-word expressions and so on, as defined in Universal Dependencies. This is detailed in the Appendix.
In order to make tree swapping easier, we also only consider projective subtrees and only allow one swap at a time. While we could generate more sentences by allowing multiple swaps, one swap already leads to a great amount of generated sentences so we only consider this case in this paper. Eventually, we do not swap trees between a sentence and itself, to avoid intra-sentence redundancy. And we never swap main clause predicates (this is taken care for by avoiding the root relation) because it would not create new sentences, but rather duplicate them.

Low-resource Experiments with High-Resource Languages
To be able to perform extensive experiments, we chose to work with resource-rich languages and use them to mimic low-resource conditions. The most basic way to do so is to sample a small number of sentences from a big treebank to artificially create a small one. This has the big advantage that we can sample many different small treebanks and therefore dampen the influence of any given training sentence on parsing results. This is especially important in a very low-resource setting where each sentence can have a strong impact on the actual parsing results, while not being very representative of its actual source language.
However, the difference between high-resource treebanks and low-resource treebanks goes beyond the mere number of sentences or words. Bigger treebanks tend to have more varied constructions than their smaller counterparts 1 . This means that for a set number of sentences or tokens, a sample from a resourcerich treebank may contain more information than a sample from a resource-poor one. Therefore, if we want to mimic low-resource languages with high-resource ones, we have to take sentence complexity into consideration.
In this paper we use sentence length as a surrogate for sentence complexity, under the assumption that more complex constructions require more words 2 . Other complexity markers could be taken into account such as dependency relation distribution. But annotators do not choose sentences according to their internal structures or complexity, thus the relation distribution of a small treebank is as contingent as the presence of any given sentence in it. So we keep this open for future work.

Low-Resource Sampling
In order to mimic low-resource languages with high-resource ones, we sample data from high-resource languages using the macro averaged probability distribution of sentence length for languages that have less than 1000 training sentences overall. We truncate the distribution to sentences shorter than 100 tokens. Figure 2 represents the sentence length distribution of four sets of languages from UD 2.4. The dashed red line represents training sentences of languages with more than 1000 training sentences (over all the training sets when there are multiple treebanks). The dash dotted teal line represents training sentences of languages with less than 1000 training sentences (low-resource languages). In plain blue is a smoothed 3 version of the same distribution used to sample our training sets. In dotted orange is the distribution of test sentence length for languages that do not have a training set at all. While having a higher mode, the faster decay of the low-resource distribution gives it a lower mean than the resource-rich one. The test distribution of under-resourced languages (without a training set), is skewed toward short sentences, with both a low mean and a fast decay.  Figure 2: Macro average distribution of training sentence length for all languages, resource rich languages (more than 1000 training sentences), low-resource languages (less than 1000 training sentences), its smoothed version used for sampling sentences and test sentences of languages without a training set in UD 2.4. Table 1 reports the average sentence length for the treebanks used in the experiments as well as for all resource-rich languages from UD 2.4 (more than 1000 training sentences), less-resourced languages (less than 1000 training sentences) and languages without training sentences. We see that average sentence length covers a range of values with short sentences in Finnish and 3 times longer in Hebrew. We also see that sentence length does not correlate with morphological complexity as on average Basque sentences are only one word shorter than Vietnamese ones, while Basque is morphologically much richer than Vietnamese. But from the three last lines, we see that as the quantity of resources decreases, so does the average sentence length. This is especially true for languages with no training data, where test sentences are 5 words (28.5%) shorter than those of resource-rich languages. This also means that test sets of better resourced languages must be more complex and more representative of their actual language. Thus, testing models trained with sub-sampled high-resource languages on their full test sets, should give a faithful lower bound of the scores achievable for really low-resource languages.

Experiments
In order to assess the potential of tree swapping as a data augmentation method for low-resource languages, we ran a series of experiments on 10 languages from UD 2.4, representing various families and various syntactical and morphological typologies. For each language, we sampled 8 times 40 sentences with the smoothed distribution of low-resource language sentence lengths depicted in Figure 2. While 40 training sentences might seem very low, in UD 2.4, Buryat has 19 training sentences, Kurmanji 20, Upper Sorbian 23 and Kazakh 31.

Quantity and Quality of Generated Sentences
We first start by looking at the number of new sentences generated by our method and compare it with the number of sentences generated by sentence cropping and rotation. We then look at the quality of the sentences generated by our method.

Number of Generated Sentences
While the constraints presented in Section 3 are very strict to prevent most non grammaticalities while staying language-agnostic, they still lead to a good number of new sentences. By definition, Şahin and Steedman (2018)'s crop and rotate produce linearly many new sentences with respect to the number of  For each language we sample 8 sets of 40 sentences, each following the same sentence length distribution, and use them as source to generate new sentences.
original sentences, but one swap creates quadratically many new sentences. If applied several times, tree swapping could generate arbitrarily many and arbitrarily big sentences, on the order of O(n k+1 ), where n is the number of original sentences and k the number of swaps. Table 2 reports the average number of sentences generated by Şahin and Steedman's crop and rotate operations and by our own tree swapping operation on datasets containing 40 sentences each.
As expected, we see that the number of new sentences generated by tree swapping is much higher than the number generated by crop and rotate. We also see that there is a correlation between morphological richness and the number of generated sentences. Morphologically rich languages (Basque, Finnish, Hebrew and Russian) have a rather low number of new sentences, less than one per pair (n 2 = 1600), meaning that not all sentence pairs have compatible subtrees because of the rigid constraints. On the other hand, morphologically simpler languages (such as Vietnamese) generate more sentences, almost 3 new sentences per pair, as there are less morphological constraints on subtrees.
Tamil is an outlier here, as it has the highest number of generated sentences despite being morphologically rich. Upon closer investigation, it turns out that UD Tamil data is relatively less complex than that of other morphologically rich languages. For example, in Russian sentence sets, there are 78 (POS, morphological features, dependency relation) triplets (with NOUN as POS) for 177 nouns (2.27 nouns per triplet), while in Tamil there are only 38 such triplets for 187 nouns (4.92 nouns per triplet). And Tamil data are much more skewed than those of other morphologically rich languages in terms of morphological feature distributions. The most frequent triplet where the POS is noun appears less than 6% of the time in Russian, but more than 23% of the time in Tamil.

Grammaticality of Generated Sentences
As we mentioned earlier, the strong constraints imposed on the trees to be swapped try to ensure that most generated sentences are grammatical. However, as it is hard to guarantee that every new sentence is actually valid, we looked at English and French sentences generated by the swap mechanism.
On a set of 100 sentences generated from the French Sequoia treebank, 14 could be considered problematic. Nine had an nmod noun phrase replaced by another one either lacking a determiner or inserting one in a context that does not allow it. Out of these nine sentences, three were actually due to dates in which the months were replaced by another noun phrase. Similarly, one sentence had a subordinated clause whose infinitive verb was missing an adposition. Three sentences were odd but not strictly ungrammatical due to adjectives that usually occur after their noun being placed before. Eventually, a sentence was containing a quote, and while the modification sounded odd inside a quote, outside it could be considered ungrammatical.
Out of 100 sentences generated from the English EWT treebank, 4 sentences were problematic. Two had problems with determiners after an adjective swap. In one case, the new adjective was starting with a consonant while the previous one started in a vowel, rendering the previous "an" irregular. In the other case, an adjective in a bare noun phrase was replaced by "same" which thus lacked a definite article. The other two were involving nested clauses. In one, a relative clause in which the governing noun filled the object role, was replaced by a relative clause with all its core arguments filled. In the other, a simple direct object was swapped with a direct object with a relative clause which in context was very odd.
Further exploration of the sentences generated for the experiments, revealed that adpositions are also a source of error, for example when swapping bare infinitives with to-infinitives. As this was already seen in French, we assume the same to be true for other languages. Examples of generated sentences are given * He says , I have to have an Corporate ADDRESS . * The forest phlox is blooming to Australia instead of mid-May . Please send me an excel spreadsheet which depicts the Shrimp Scampi Dinner . Do n't go , or you will learn how to waste any form of modern medicine . Table 3: Examples of generated sentences. The newly inserted material is underlined. The second sentence is not strictly ungrammatical, but shows an example of adposition mismatch.
in Table 3. Problematic sentences are marked with an asterisk.
A reviewer mentioned their concerns about the fact that adpositions may be more important to grammaticality in other languages and that we should also enforce adpositions or other markers of the root to agree between sub-trees to be swapped. We agree with this remark and note that some UD treebanks start to be annotated with so called enhanced dependencies which, amongst other improvements, enhance relation labels with the lemma of the adposition that selects the phrase. Enhanced relations also encode information about the role of the governor in relative clauses. While the number of enhanced treebanks is still low, using enhanced relations instead of vanilla ones will further increase grammaticality of the created sentences.
This being said, the errors made by our method are actually interesting in that they question the notion of grammaticality. While a French or an English speaker would not produce them, if one was asked to analyse them within the UD framework, their analysis would likely be similar to the ones produced by our method.
We should also note that while 14 problematic sentences out of a hundred might seem a lot, it represents in fact 14 questionable arcs out of several hundreds. This opens a question that we would need to consider in future work. Namely, assuming that the test sentences are grammatical themselves, to which extent can the training data be ungrammatical.

Parsing Results
In a first experiment, we compare tree swapping with previous data augmentation techniques : tree cropping and rotation (Şahin and Steedman, 2018) and nonce sentences (Gulordava et al., 2018). As the number of new sentences that can be generated through tree swapping grows in O(n k+1 ), where n is the number of original sentences and k the number of swaps per sentence, we also look at the impact of the number of generated sentences. As the constraints for tree swapping are strong and therefore limit the number of sentences that can be generated, in a second experiment, we look at the impact of relaxing those constraints. Eventually, as the number of generated sentences gets big enough, it becomes possible to generate a separated development set for validating the trained models, so we also look at the impact of using an artificial development set.
We run each experiment 8 times (one for each sample). The reported results are LAS scores on the original dev sets averaged over those 8 runs. For all the experiments, we use an implementation of the biaffine parser of Dozat et al. (2017) available on Github 4 . We do not use pretrained word embeddings, so word and character representations are learned from scratch alongside the rest of the model. In this reduced data setting, we use 100 dimensions for word embeddings and 50 for characters. We only consider the task of parsing and use gold segmentation and UPOS tags. Table 4 reports the average parsing results obtained on the development set using the base sampled training sets and augmented sets via cropping, rotating, nonce sentences and swapping. For this experiment, we use the maximum number of sentences generated by cropping and rotating. For nonce sentences and tree swapping, as it is easy to generate more sentences, we create n new sentences per original sentence. We create 4 nonce sentences per original sentence, for a total of 200 training sentences. For tree swapping, we experiment with n ∈ {2, 3, 4, 9} effectively giving training sets of size 120, 160, 200 and 400. While   (2018). Nonce uses sentences generated by replacing single words with compatible ones from other sentences as in Gulordava et al. (2018). Swap is our new subtree swapping method. Beside each method name is the average increase of number of sentences compared to the baseline.

Crop, Rotate, Nonce and Swap
this is more sentences than for crop and rotate, the original input is always the same 40 sentences, the difference comes from augmentation techniques and not from the data, so the comparison remains fair. We see that on average, all augmentation techniques improve the results over the low-resource baseline. Swapping is consistently the best option, ignoring typological differences. More in detail, we have a strict ordering of crop, rotate and swap. Crop, which generates sentences with reduced complexity, has the lowest score of the three. Rotate, which keeps most of the original complexity but reorders it, is better. Swap, which potentially introduces new complexity, has the best score. As nonce is a weaker version of swap that only changes words in place and does not introduce new structures, it has a lower score than swap. This strongly suggests that a good data augmentation technique needs to create syntactic complexity.
Furthermore, while swap already beats rotate when they are allowed a comparable number of new sentences (2.98~3), swap can generate much more new sentences than rotate, and as we see, on average the score increases with the number of new sentences. This is interesting since not only does the new method generate a large number of sentences, but the models actually benefit from those sentences.  Table 5: Average parsing results for different data augmentation constraints. P means that the heads of swapped trees must agree in POS tags. M means that they must agree in morphological features. R means that they must agree in dependency relation. None means that no such constraints apply. The last row is the best of each columns, assuming selection on a validation set. Each original sentence receives 4 new ones. Table 5 reports parsing results obtained on the development set with augmented training sets via tree swapping under all the possible constraint relaxations. Note that while POS tag and morphological features are tied to the head of a subtree, the dependency relation depends on the head of the subtree, but also on its new head and their relative position. We therefore choose to assign the relation of the original subtree's head to the newly swapped subtree's head. This might not be the best choice in all circumstances and is certainly very language specific. We notice several interesting things. First, data augmentation is substantially beneficial for all but one language (Finnish) under all constraints. Then, over all the augmented settings, the one without constraints (last column) has the lowest score. This shows that constraints are important, supposedly because they help preserve grammaticality. And, as expected, not all constraints fit all languages equally well.

Language
While some language/constraint scores might be surprising, it is important to note that we only partially relax those constraints. As we focus on a rather small subset of POS tags and on core dependency relations, relaxing some constraints might not have as strong an effect as if we had allowed the whole range of tree swapping in the first place. Even when relaxing the POS constraint, we only allow verbs, nouns, proper nouns and adjectives to move. Furthermore, in some languages there might be a strong overlap between relations and POS tags or relations and morphological features, dampening even more the effect of relaxing a constraint. For example, in a language where nouns inflect for case but verbs do not, morphological features alone will most of the time be sufficient to ensure that we do not swap verbs with nouns, so the POS constraint is superfluous. In fact, the best constraint on average is just agreeing POS tags, but it doesn't work for all languages such as Tamil and Finnish. Yet from the last row, it seems that the best option is still to pick the best constraint set for each language independently via validation on held-out data.
Looking more closely at languages individually, general patterns are harder to find. Quite surprisingly, we see that morphological features alone are not a good constraint. For Vietnamese, which is a morphologically very simple language, morphological features do not add anything. In fact, its PM and P are almost identical and give the same score. For Russian, cases can be triggered by different prepositions, meaning that cases need not align with dependency relations, and therefore relying solely on case for swapping trees may lead to erroneous trees. This is all the more possible given that we reassign subtrees' head dependency relations, meaning that we can have a prepositional phrase labeled as a direct object based on the morphology of its head alone. We guess similar reasons are behind the specific patterns of each language.
Another point to note is that the biaffine parser used for our experiments does not encode morphological information directly, but uses a character based word representation to indirectly represent morphological features. However, it has a dedicated POS tag embedding and uses dependency relation directly in its learning objective. That gives a different role to each constraint, with the morphological feature constraint being more remote from the parsing algorithm than the other two. We assume different parsing algorithms relying on different input representations would work differently with those constraints.

Importance of Development Data
When the number of sentences available is very low, it might not be possible to further split them into a training and a development set. This means that methods that need validations are more prone to overfitting as they will use the same data for training and validation. However, our data augmentation method creates enough new data so that it becomes possible to have a dedicated validation set distinct from the training set. But as this validation set is generated from the same original data as the training set, we need to see if it fulfills its role. Table 6 reports results of parsers trained using different validation sets. The two middle rows (Train and Dev) are results using data augmented from the same original sentences. Train uses the same augmented set of 200 sentences for both training and validation (as does Base, but Base only has access the 40 original sentences), while Dev uses a distinct development set created for validation purpose. For the last row, the models were validated with the validation set created from another sentence sample.
Whether training benefits from a distinct validation set is language dependent and on average not significant. However, as expected, using a validation set based on other sentences is beneficial. This is interesting since data augmentation is otherwise useful. The newly created sentences seem  Table 6: Average parsing results using various data sets as validation sets. Train uses the augmented training set for both training and validation of the model. Dev uses a distinct development set created from the same sentences as the train set for validation. Other uses a development set created from another set of 40 sentences for validation.
diverse enough to improve the quality of the learnt models, but not enough to make a real difference when it comes to validation. A likely explanation is that while the parser sees new structures in the augmented data, the basic underlying information, especially the vocabulary, remains the same. Thus models validated on a created development set (different structure, but same vocabulary) may be pushed to rely more on vocabulary than on the actual dependency structure. On the other hand, models validated on a different set of sentences (different structure and different vocabulary) are pushed to pay more attention to structure since they need to handle unseen words.

Conclusion
In this paper, we have presented a new language-agnostic data augmentation technique based on dependency subtree swapping for creating new dependency trees. We also presented a method for performing more faithful low-resource experiments using high-resource languages by sampling training data under a distribution that favors shorter sentences to mimic the sentence length distribution of low-resource languages.
We have shown that our newly proposed tree swapping method consistently outperforms previously proposed augmentation techniques based on tree morphing. We have also shown that our method can create many new sentences and that they are useful for parser training as the score increases with the number of sentences. This is important since contrary to previous tree morphing techniques, the number of sentences created by tree swapping is potentially unbounded. Then, we have shown that relaxing the strong swapping constraints on a per language basis further improves the results. But that the language/constraints relation is not necessarily clear. Finally, we saw that despite being useful for training parsers, the created sentences are not diverse enough to be useful for model validation.
Previous works have demonstrated the possibility of training parsers with incomplete annotation (Lacroix et al., 2016). As a few generated sentences may be odd sounding or slightly ungrammatical, it would be interesting to see how parsers fare when trained with sound trees over ungrammatical sentences. We keep it as a future work. We also need to further investigate the three-way interaction between languages, augmentation techniques and parsing algorithms, as apparently not all augmentation techniques fare as well for all languages. Mixing data augmentation policies might also have a positive impact. More generally, it would be interesting to see how far a parser can go with only a handful of annotated sentences.