Data Augmentation via Dependency Tree Morphing for Low-Resource Languages

Neural NLP systems achieve high scores in the presence of sizable training dataset. Lack of such datasets leads to poor system performances in the case low-resource languages. We present two simple text augmentation techniques using dependency trees, inspired from image processing. We “crop” sentences by removing dependency links, and we “rotate” sentences by moving the tree fragments around the root. We apply these techniques to augment the training sets of low-resource languages in Universal Dependencies project. We implement a character-level sequence tagging model and evaluate the augmented datasets on part-of-speech tagging task. We show that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.


Introduction
Most recently, various deep learning methods have been proposed for many natural language understanding tasks including sentiment analysis, question answering, dependency parsing and semantic role labeling.Although these methods have reported state-of-the-art results for languages with rich resources, no significant improvement has been announced for low-resource languages.In other words, feature-engineered statistical models still perform better than these neural models for low-resource languages. 1Generally accepted reason for low scores is the size of the training data, i.e., training labels being too sparse to extract meaningful statistics.
Label-preserving data augmentation techniques are known to help methods generalize better by increasing the variance of the training data.It has been a common practice among researchers in computer vision field to apply data augmentation, e.g., flip, crop, scale and rotate images, for tasks like image classification (Ciresan et al., 2012;Krizhevsky et al., 2012).Similarly, speech recognition systems made use of augmentation techniques like changing the tone and speed of the audio (Ko et al., 2015;Ragni et al., 2014), noise addition (Hartmann et al., 2016) and synthetic audio generation (Takahashi et al., 2016).Comparable techniques for data augmentation are less obvious for NLP tasks, due to structural differences among languages.There are only a small number of studies that tackle data augmentation techniques for NLP, such as Zhang et al. (2015) for text classification and Fadaee et al. (2017) for machine translation.
In this work, we focus on languages with small training datasets, that are made available by the Universal Dependency (UD) project.These languages are dominantly from Uralic, Turkic, Slavic and Baltic language families, which are known to have extensive morphological case-marking systems and relatively free word order.With these languages in mind, we propose an easily adaptable, multilingual text augmentation technique based on dependency trees, inspired from two common augmentation methods from image processing: cropping and rotating.As images are cropped to focus on a particular item, we crop the sentences to form other smaller, meaningful and focused sentences.As images are rotated around a center, we rotate the portable tree fragments around the root of the dependency tree to form a synthetic sentence.We augment the training sets of these low-resource languages via crop and rotate operations.In order to measure the impact of augmentation, we implement a unified characterlevel sequence tagging model.We systematically train separate parts-of-speech tagging models with the original and augmented training sets, and evaluate on the original test set.We show that crop and rotate provide improvements over the nonaugmented data for majority of the languages, especially for languages with rich case marking system.

Method
We borrow two fundamental label-preserving augmentation ideas from image processing: cropping and rotation.Image cropping can be defined as removal of some of the peripheral areas of an image to focus on the subject/object (e.g., focusing on the flower in a large green field).Following this basic idea, we aim to identify the parts of the sentence that we want to focus and remove the other chunks, i.e., form simpler/smaller meaningful sentences2 .In order to do so, we take advantage of dependency trees which provide us with links to focuses, such as subjects and objects.The idea is demonstrated in Fig. 1b on the Turkish sentence given in Fig. 1a.Here, given a predicate (wrote) that governs a subject (her father), an indirect object (to her) and a direct object (a letter); we form three smaller sentences with a focus on the subject (first row in Fig. 1b: her father wrote) and the objects (second and third row) by removing all dependency links other than the focus (with its subtree).Obviously, cropping may cause semantic shifts on a sentence-level.However it preserves local syntactic tags and even shallow semantic labels.
Images are rotated around a chosen center with a certain degree to enhance the training data.Similarly, we choose the root as the center of the sentence and rotate the flexible tree fragments around the root for augmentation.Flexible fragments are usually defined by the morphological typology of the language (Futrell et al., 2015).For instance, languages close to analytical typology such as English, rarely have inflectional morphemes.They do not mark the objects/subjects, therefore words have to follow a strict order.For such languages, sentence rotation would mostly introduce noise.On the other hand, large number of languages such as Latin, Greek, Persian, Romanian, Assyrian, Turkish, Finnish and Basque have no strict word order (though there is a preferred order) due to their extensive marking system.Hence, flexible parts are defined as marked fragments which are again, subjects and objects.Rotation is illustrated in Fig. 1c on the same sentence.In order to investigate the impact of the augmentation, we design a simple sequence tagging model that operates on the character level.Many low-resource languages we deal with in the Experiments section are morphologically rich.Therefore, we use a character-level model to address the rare word problem and to learn morphological regularities among words.
For each sentence s, we produce a label sequence l, where l t refers to POS tag for the t-th token.Given g as gold labels and θ as model parameters we find the values that minimize the negative log likelihood of the sequence: To calculate p(l t |θ, s), we first calculate a word embedding, w, for each word.We consider words as a sequence of characters c 0 , c 1 , .., c n and use a bi-LSTM unit to compose the character sequence into w, as in Ling et al. (2015): Later, these embeddings are passed onto another bi-LSTM unit: Hidden states from both directions are concatenated and mapped by a linear layer to the label space.Then label probabilities are calculated by a softmax function: Finally the label with the highest probability is assigned to the input.

Experiments and Results
We use the data provided by Universal Dependencies v2.1 (Nivre et al., 2017) project.Since our focus is on languages with low resources, we only consider the ones that have less than 120K tokens.
The languages without standard splits and sizes less than 5K tokens, are ignored.We use the universal POS tags defined by UD v2.1.
To keep our approach as language agnostic and simple as possible, we use the following universal dependency labels and their subtypes to extract the focus and the flexible fragment: NSUBJ (nominal subject), IOBJ (indirect object), OBJ (indirect object) and OBL (oblique nominal).These dependency labels are referred to as Label of Interest (LOI).The root/predicate may be a phrase rather than a single token.We use the following relations to identify such cases: FIXED, FLAT, COP (copula) and COMPOUND.Other labels such as ADVMOD can also be considered flexible, however are ignored for the sake of simplicity.We enumerate all flexible chunks and calculate all reordering permutations.Keeping the LOI limited is necessary to reduce the number of permutations.We apply reordering only to the first level in the tree.Our method overgeneralizes, to include sequences that are not grammatical in the language in question.We regard the ungrammatical sentences as noise.
Number of possible cropping operations are limited to the number of items that are linked via an LOI to the root.If we call it n, then the number of possible rotations would be (n + 1)! since n pieces and the root are flexible and can be placed anywhere in the sentence.To limit the number of rotations, we calculate all possible permutations for reordering the sentence and then randomly pick n of them.Each operation is applied with a certain probability p to each sentence, (e.g., if p = 1, n number of crops; if p = 0.5 an average of n/2 crops will be done).
We use the model in Sec. 2 to systematically train part-of-speech taggers on original and augmented training data sets.To be able measure the impact of the augmentation, all models are trained with the same hyperparameters.All tokens are lowercased and surrounded with special start-end symbols.Weight parameters are uniformly initialized between −0.1 and +0.1.We used one layer bi-LSTMs both for character composition and POS tagging with hidden size of 200.Character embedding size is chosen as 200.We used dropout, gradient clipping and early stopping to prevent overfitting for all experiments.Stochastic gradient descent with an initial learning rate as 1 is used as the optimizer.Learning rate is reduced by half if scores on development set do not improve.
Average of multiple runs for 20 languages are given in Fig. 1.Here, Org column refers to our baseline with non-augmented, original training set, where Imp% is the improvement over the baseline by the best crop/flip model for that language.It is evident that, with some minor exceptions, all languages have benefited from a type of augmentation.We see that the biggest improvements are achieved on Irish and Lithuanian, the ones with the lowest baseline scores and the smallest training sets3 .Our result on both languages show that both operations reduced the generalization error surprisingly well in the lack of training data.
Tagging results depend on many factors such as the training data size, the source of the treebank (e.g., news may have less objects and subjects compared to a story), and the language typology (e.g., number/type of case markers it uses).In Fig. 2, the relation between the data size and the improvement by the augmentation is shown.Pearson correlation coefficient for two variables is calculated as −0.35.
Indo-European (IE) : Baltic and Slavic languages are known to have around 7 distinct case markers, which relaxes the word order.As expectedly, both augmentation techniques improve the scores for Baltic (Latvian, Lithuanian) and Slavic (Belarusian, Slovak, Serbian, Ukranian) OCS is solely compiled from bible text which is known to contain longer and passive sentences.We observe that rotation performs slightly better than cropping for Slavic languages.In the presence of a rich marking system, rotation can be considered a better augmenter, since it greatly increases the variance of the training data by shuffling.For Germanic (Gothic, Afrikaans, Danish) languages, we do not observe a repeating gain, due to lack of necessary markers.
Uralic and Turkic : Both language types have an extensive marking system.Hence, similar to Slavic languages, both techniques improve the score.
Dravidian : Case system of modern Tamil defines 8 distinct markers, which explains the improved accuracies of the augmented models.We would expect a similar result for Telugu.However Telugu treebank is entirely composed of sentences from a grammar book which may not be expressive and diverse.

Related Work
Similar to sentence cropping, Vickrey and Koller (2008) define transformation rules to simplify sentences (e.g., I was not given a chance to eat -I ate) and shows that enrichening training set with simplified sentences improves the results of semantic role labeling.One of the first studies in text augmentation (Zhang et al., 2015), replaces a randomly chosen word with its randomly chosen synonym extracted from a thesaurus.They report improved test scores when a large neural model is trained with the augmented set.Jia and Liang (2016)

Discussion
Unlike majority of previous NLP augmentation techniques, the proposed methods are meaningpreserving, i.e., they preserve the fundamental meaning of the sentence for most of the tested languages.Therefore can be used for variety of problems such as semantic role labeling, sentiment analysis, text classification.Instead of those problems, we evaluate the idea on the simplest possible task (POS) for the following reasons: • It gets harder to measure the impact of the idea as the system/task gets complicated due to large number of parameters.• POS tagging performance is a good indicator of performances of other structured prediction tasks, since POS tags are crucial features for higher-level NLP tasks.
Our research interest was to observe which augmentation technique would improve which language, rather than finding one good model.Therefore we have not used development sets to choose one good augmentation model.

Conclusion and Future Work
Neural models have become a standard approach for many NLP problems due to their ability to extract high-level features and generalization capability.Although they have achieved state-ofthe-art results in NLP benchmarks with languages with large amount of training data, low-resource languages have not yet benefited from neural models.In this work, we presented two simple text augmentation techniques using dependency trees inspired by image cropping and rotating.We evaluated their impact on parts-of-speech tagging in a number of low-resource languages from various language families.Our results show that: • Language families with rich case marking systems (e.g., Baltic, Slavic, Uralic) benefit both from cropping and rotation.However, for such languages, rotation increases the variance of the data relatively more, leading to slightly better accuracies.
• Both techniques provide substantial improvements over the baseline (non-augmented data) when only a tiny training dataset is available.
This work aimed to measure the impact of the basic techniques, rather than creating the best text augmentation method.Following these encouraging results, method can be improved by (1) considering the preferred chunk order of the language during rotation, (2) taking languagespecific flexibilities into account (e.g., Spanish typically allows free subject inversion (unlike object)).Furthermore, we plan to extend this work by evaluating the augmentation on other NLP benchmarks such as language modeling, dependency parsing and semantic role labeling.The code is available at https://github.com/gozdesahin/crop-rotate-augment.

Figure 1 :
Figure1: Demonstration of augmentation ideas on the Turkish sentence "Babası ona bir mektup yazdı" (Her father wrote her a letter).S: Subject, V: Verb, O:Object, IO: Indirect Object.Arrows are drawn from dependent to head.Both methods are applied to the Labels of Interest (LOI).

Figure 2 :
Figure 2: Treebank size versus gain by augmentation

Table 1 :
POS tagging accuracies on UDv2.1 test sets.Best scores are shown with bold.Org: Original.p: operation probability.Imp%: Improvement over original (Org) by the best model trained with the augmented data.
languages, except for Old Church Slavic (OCS).