Training Data Augmentation for Low-Resource Morphological Inflection

This work describes the UoE-LMU sub-mission for the CoNLL-SIGMORPHON 2017 Shared Task on Universal Morphological Reinﬂection, Subtask 1: given a lemma and target morphological tags, generate the target inﬂected form. We evaluate several ways to improve performance in the 1000-example setting: three meth-ods to augment the training data with identical input-output pairs (i.e., autoencod-ing), a heuristic approach to identify likely pairs of inﬂectional variants from an unlabeled corpus, and a method for cross-lingual knowledge transfer. We ﬁnd that autoencoding random strings works surprisingly well, outperformed only slightly by autoencoding words from an unlabelled corpus. The random string method also works well in the 10,000-example setting despite not being tuned for it. Among 18 submissions our system takes 1st and 6th place in the 10k and 1k settings, respectively.


Introduction
Morphological variation is a major contributor to the sparse data problem in NLP, especially for under-resourced languages. The SIGMORPHON 2016 Shared Task (Cotterell et al., 2016) and CoNLL-SIGMORPHON 2017Shared Task (Cotterell et al., 2017 aimed to inspire researchers to develop better systems for morphological inflection across a wide range of languages with varying training resources. In 2016, when over 10,000 training examples were provided for each language, neural network-based systems performed considerably better than other approaches (Cotterell et al., 2016). The 2017 Shared Task therefore included settings with different amounts of training data (100, 1000, or 10,000 examples), to examine which approaches might work best when data is even more limited.
We focus on the 1000-example setting of Subtask 1: given a lemma with its part of speech and target morphological tags, generate the target inflected form. Our baseline system is the winning system from 2016, the morphological encoderdecoder (MED) from LMU (Kann and Schütze, 2016). We explore methods for improving its performance in the face of fewer training examples, either with or without additional data from unlabeled corpora. 1 In particular, we focus on various training data augmentation methods. One uses a heuristic approach to identify likely pairs of inflectional variants from an unlabeled corpus in an unsupervised way, and uses these as input-output pairs. Three other methods augment the training data with identical input-output pairs-i.e., simultaneously training the network to perform autoencoding. We compare three types of autoencoder inputs: either lemmas and inflected forms from the training data, words from an unlabeled corpus, or random strings.
We present detailed results and comparisons for various amounts of added training data for all 52 languages of the shared task. We find that all methods improve considerably over the MED baseline (7.2-10.7% across all languages; a 15.5% improvement over the shared task baseline). Most of the benefit comes with only 4k extra examples, but performance continues to improve up to 16k added examples.
After controlling for the amount of additional data, we see only a small benefit from autoencoding corpus words rather than random strings. Using hypothesized morphological variants works as well as random strings. These results suggest that the main advantage of all these methods is providing a strong bias towards learning the identity transformation and/or working as regularizers, rather than learning language-specific phonotactics or morpho-phonological changes. 2 Finally, following , we test whether cross-lingual knowledge transfer can help, by multilingual training of joint models for groups of up to 10 related languages. However, we find no improvement as compared to using an equivalent amount of random-string autoencoder examples.
Our final submission to the shared task consists of two submissions for Subtask 1 (Inflection): the random string autoencoder for the medium and high data settings of the restricted (main) track and the corpus word autoencoder for the medium data setting of the bonus track (track with external monolingual corpora). All systems use 16K autoencoder examples and an ensemble of three training runs with majority voting.
In the high resource setting of the restricted track, our system outperforms all 17 other submissions, with an average test set accuracy of 95.32% over 52 languages. In the medium resource setting, among 18 submissions our system takes the 6th place with 81.02% (1.78% below the top system). In the medium resource setting of the bonus track, among 2 submissions our system comes first with 82.37%.

Baseline MED System
For our baseline system (henceforth MEDbaseline or MED), we use the sequence encoderdecoder architecture and input/output format of the 2016 Shared Task winner (Kann and Schütze, 2016). The architecture follows Bahdanau et al. (2015): that is, the encoder is a bidirectional gated recurrent neural network (GRU) with attention, and the decoder is a uni-directional GRU. For details, see Kann and Schütze (2016) and Bahdanau et al. (2015).
The input sequence consists of space separated characters of the input lemma-the dictionary form of the word-followed by space separated morphological tags, each prepended with a tag marker, for example: w a l k t=V t=V.PTCP t=PRS (1) The target output is a sequence of characters forming the inflected word, e.g., w a l k i n g.

Augmenting the Training Data
For the Shared Task, the main competition permits no additional resources beyond the labeled training examples given for each setting. However, Wikipedia dumps in each language are provided for teams who wish to explore semi-supervised methods as well. We examine two methods that use only the training data, and two that also incorporate corpus data. Finally, we explore a method that uses multilingual resources.

No Outside Resources (AE-TD, AE-RS)
Morphologically related words are typically similar in form, and in many cases, parts of the word are copied from the lemma to the inflected form.
As suggested by , we hypothesized that training MED to copy strings as a secondary task would help with the morphological inflection task. That is, we train the model simultaneously on the tasks of morphological reinflection and sequence autoencoding, interspersing inflection training examples and autoencoding examples. This can be viewed as a form of multitask learning, 3 and is equivalent to maximizing the log-likelihood for both tasks: where D is the labeled training data, with each example consisting of a lemma l, a morphological tag t and an inflected form w, and S is a set of autoencoding examples. The function e represents the encoder, which depends on θ.
In the setting with no outside resources we experiment with two variants of the sequence autoencoder. The first of these, AE-TD, uses the lemmas and target forms in the training data as inputs to the autoencoder, yielding up to twice as many autoencoder inputs as inflection training pairs (any duplicate lemmas or target forms are included only once).
Our second autoencoder variant, AE-RS, uses randomly generated strings as inputs, which means we can produce an arbitrary number of autoencoding examples. In this and following systems, we use the postfix XXK (e.g. 1K, 2K, 4K) to indicate the number of additional examples generated. To obtain each example, we first choose its length uniformly at random from the interval [4,12] and then sample each character uniformly at random from the alphabet of the respective language.
Input/Output Format We generally follow the input representation outlined in §2, except that morphological tags are replaced with one special tag that stands for autoencoding. The output format does not change.

Corpus Word Autoencoder (AE-CW)
In the setting where corpus data is available we can replace randomly generated strings with corpus words (system AE-CW). We hypothesized that autoencoding words from the actual language would give not only the benefit of learning to copy, but also the benefit of learning the character distributions typical of a given language.
To select words for the corpus word autoencoding task we use Wikipedia text dumps provided for the shared task. We filter out all words shorter than 4 characters. For each language we learn its alphabet from the letters that occur in training words of the original SIGMORPHON training sets and filter out all words that contain foreign characters. We use the remaining words to sample uniformly without repetition the required amount of autoencoding examples. The input and output formats are the same as for the previous approaches.

Data Mining for Inflected Pairs (DM)
Our next method, DM, mines the corpus data to create new training examples by (a) inferring new lemma-inflected form pairs and (b) predicting the tags of the inflected forms. We describe each step below.
Inferring Lemma-Word Form Pairs Although most work on unsupervised learning of morphology has focused on decomposing words into mor- phemes, their constituent parts, e.g., (Kurimo et al., 2010;Hammarström and Borin, 2011), others have focused on finding morphologically related words and the orthographic patterns relating them (Schone and Jurafsky, 2000;Baroni et al., 2002;Neuvel and Fulop, 2002;Soricut and Och, 2015): We adopt the algorithm by Neuvel and Fulop (2002) to learn Word Formation Strategies (WFS)-frequently occurring orthographic patterns that relate whole words to other whole words. The input of this algorithm is a list of N words 4 . The algorithm works by comparing each of the N words to all other words. It first finds word similarities as the Longest Common Subsequence (LCS) between the two words. Then it finds word differences as the orthographic differences with respect to similarities (see Table 2 for examples). Finally, all word pairs with the same differences have their similarities and differences merged into one WFS. For example, words in Table 2 sanction the following WFS: * ##ceive ⇔ * ##ception (4) where * and # stand for the optional and mandatory character wild cards respectively. The interpretation of the WFS in Example 4 is: "a word that ends with ceive preceded by 2 to 3 characters predicts another word ending with ception preceded by the same 2 to 3 characters" (Neuvel and Fulop, 2002). Table 1 gives English WFS examples and sample word pairs that warranted their creation.

Tag Prediction
To perform labeling we make use of two resources: (i) a word embedding based part of speech (POS) classifier; (ii) the WFS we learned previously.
Each of the original shared task training examples contains two word forms and a set of morphological tags including word POS information (see Example 1). We can use words with POS labels and word embeddings to train a POS classifier. Namely we train a support vector machine (SVM) (Cortes and Vapnik, 1995) to predict POS labels using word embeddings as features. We train word embeddings on the Wikipedia text dumps provided for the task. We use Fast-Text 5 by Bojanowski et al. (2016) to train 300 dimensional continuous bag of words embeddings (Mikolov et al., 2013) for all words occurring at least 5 times.
For each training example in the SIGMOR-PHON training data we examine each of the previously learned WFS. If the training example fits the WFS's orthographic constraints, we examine all word pairs that warranted the creation of this WFS. For a word pair to be labeled with the same morphological tags as the training example we require that both words are classified with the same part of speech as the original training example.
Input/Output Format To encode the additional training examples we use the same format as for the original ones (see §2) except we add a tag which signals that this example has been automatically extracted. We do this as a measure of caution to avoid ambiguities introduced by potentially erroneous training examples.

Using Multilingual Resources (MLT)
Recently,  showed that training on a high-resource language can improve morphological inflection on a related low-resource language using an encoder-decoder system like the one here. This can be done using the same network as above, but training on inflection examples from multiple different languages-yet another form of multitask learning, since the model parameters are shared between languages. Following , our multilingually trained system (MLT) uses the same input format as above, but with an additional tag prepended to each example indicating which language it is from.
Our setup is slightly different from that of , since they had only 50 or 200 examples in each target language, and used only one or two other higher-resource languages for transfer. Here, we have similar amounts of data for each language (1000 examples in most cases), and use larger groups of related languages to train together (see §4).

Experiments
Datasets The official shared task data consists of sets for 52 different languages, of which 40 were released as development languages. We used these for our preliminary experiments to compare different systems and quantities of additional training data. The remaining 12 "surprise" languages were released shortly before the test phase of the shared task, and we report results for our best systems on these as well. 6 In the "medium" training data setting, which we focus on in this work, 1000 training instances are given for each language, except for Scottish Gaelic with only 681 instances. Additionally, development sets with 1000 instances are available for all languages except for Basque, Bengali, Haida, Welsh with 100 and Scottish Gaelic with 50.

MED Parameters
In our experiments we use the same training method and hyper parameter settings as suggested by Kann and Schütze (2016). Namely, we use 100 hidden units for the encoder and decoder GRUs; 300 dimensions for encoder and decoder embeddings. For training we use stochastic gradient descent, Adadelta (Zeiler, 2012), with gradient clipping threshold of 1.0 and mini batch size of 20. When making predictions we use beam-search decoding with a beam of size 12.
Baselines We compare our results with two baselines: the SIGMORPHON baseline 7 and MED baseline (see §2).     To investigate the quality of the mined training examples we conduct an experiment where for each morphological tag in SIGMORPHON training data we pick an alternative lemma and target form among the mined examples with the same tag. We hypothesize that, if correct, mined examples should serve the same function as the original ones and the average performance should not change. On average this resulted in 500 swaps per language. The average MED performance on the modified training sets is about 10% lower than on the original training files.

Additional Training Data
This suggests, that-although noisy-the mined examples, when annotated with an additional "mining"-tag, work as model regularizers, thus benefiting MED's performance on the inflection task.
Obtaining autoencoding examples, however, is computationally simpler than mining additional training examples, hence given the similar effects the autoencoding approach seems preferable over the data mining.

Best Monolingual Systems
The average development set accuracy over 50 languages 8 of our best system AE-CW-16K is 81.2% (see Table 4 about 15.5% and 10.6% absolute gain over the SIGMORPHON and MED baselines respectively. AE-TD on average performs only 3.5% worse than AE-CW-16K although using 87.5% fewer autoencoding examples and no external resources. Figure 2 shows the accuracy of our best systems on all languages. We report the average accuracy and standard error across three separate training runs for each language. We conducted a paired-samples t-test to compare mean development set accuracies for 50 languages between AC-CW-16K and AC-RS-16K systems, using 3 runs each. The test suggested that there was a significant difference between accuracies of AC-CW-16K and AC-RS-16K (T (49) = 4.04, p < 0.01), although the difference is small relative to the gains over the baselines. Table 3 shows results for multilingual training experiments. Due to the different development set sizes (see §4) we report weighted average development set accuracies. MLT without any additional data is better than the MED baseline for most related languages except Semitic languages, on average giving about 7% improvement over the MED baseline. MLT-AE-TD, the system in which the original training data is used to obtain autoencoding training examples on average outperforms the conservative AE-TD baseline by 2.5% absolute, but barely reaches the performance of AE-TD-RS-XK baseline. AE-TD-RS-XK baseline (for details see §4) gives performance for each individual language if all training examples of other languages in an MLT-AE-TD training file are replaced by random string autoencoding exam-ples. Table 3 shows that on average MLT-AE-TD works no better than AE-TD-RS-XK. Currently, it is unclear whether the performance gains in MLT and MLT-AE-TD experiments are due to knowledge transfer from related languages as suggested by , or because different languages serve as model regularizers with respect to each other. Performance on 6 out of 8 groups of related languages, however, suggests that random string autoencoding is not only simpler but also a better performing method than multilingual training.

Shared Task Submission: Test Results
We submitted two final systems to the mediumresource track of Subtask 1 (Inflection): AE-RS-16K for the restricted (main) track, and AE-CW-16K for the bonus track (where external monolingual corpora are permitted). The final systems are ensembles of 3 separate training runs, and the final answer is selected by majority voting (or chosen at random in case of a tie).
Although we did not tune any system to the high-resource track, we also submitted results there, using an analogous ensemble system with 16k autoencoder at random strings in addition to the 10k training examples. We did not submit to the low-resource track because the results from the AE-RS-16k method were below the baseline system on the development set.
In the high resource setting of the restricted track, our system achieved an average test set accuracy of 95.32%, which is an 17.51% absolute improvement over the Shared Task baseline, and the top performance of 18 submissions. In the medium resource setting AE-RS-16K gives 81.02%-an improvement of 16.32% absolute over the Shared Task baseline. This result, however, is 1.78% absolute lower than the best system's performance, so among 18 submissions AE-RS-16K takes the 6th place. In the medium resource setting of the bonus track among 2 submissions AE-CW-16K comes first with 82.37%. This is 1.35% better than our restricted track submission (AE-RS-16K), but still 0.43% worse than the top performing system in the restricted track.

Conclusion
We evaluated several ways to improve the morphological inflection performance of a state-of-theart encoder-decoder model (MED) when relatively few labeled examples are available. In experiments on 52 languages, we showed that all methods considerably outperformed the MED baseline. Autoencoding corpus words (AE-CW) gave the largest improvement, but was only slightly better than autoencoding random strings (AE-RS). We found no benefit from cross-lingual knowledge transfer as compared to using an equivalent number of random string autoencoder examples.
Our results suggest that the main benefit of the various data augmentation methods is providing a strong bias towards learning the identity transformation and/or regularizing the model, with a slight additional benefit obtained by learning the typical character sequences in the language. These benefits can be achieved with very simple methods and few or no additional data resources.