Unlabeled Data for Morphological Generation With Character-Based Sequence-to-Sequence Models

We present a semi-supervised way of training a character-based encoder-decoder recurrent neural network for morphological reinflection—the task of generating one inflected wordform from another. This is achieved by using unlabeled tokens or random strings as training data for an autoencoding task, adapting a network for morphological reinflection, and performing multi-task training. We thus use limited labeled data more effectively, obtaining up to 9.92% improvement over state-of-the-art baselines for 8 different languages.


Introduction
Morphologically rich languages use inflectionthe adaptation of a surface form to its syntactic context-to mark the properties of a word, e.g., gender or number of nouns or tense of verbs. This drastically increases the type-token ratio, and thus negatively effects natural language processing (NLP), making morphological analysis and generation an important field of research.
In this work, we focus on morphological reinflection (MRI), the task of mapping one inflected form of a lemma to another, given the morphological properties of the target, e.g., (smiling, Past-Part) → smiled. The lemma does not have to be known. Recently, there have been some advances on the topic, motivated by the SIGMOR-PHON 2016 shared task on morphological reinflection (Cotterell et al., 2016) and the CoNLL-SIGMORPHON 2017 shared task on universal morphological reinflection (Cotterell et al., 2017). In 2016, neural sequence-to-sequence models, specifically attention-based encoder-decoder models, outperformed all other approaches by a wide margin (Faruqui et al., 2016;Kann and Schütze, 2016). However, those models require a lot of training data, while in contrast many morphologically rich languages are low-resource, and little work has been done so far on neural models for morphology in settings with limited training data. This makes sequence-to-sequence models not applicable to morphological generation in most languages.
An abundance of unlabeled data, in contrast, can be assumed available for each language in the focus of NLP. Thus, we propose a semisupervised training method for a state-of-the-art encoder-decoder network for MRI using both labeled and unlabeled data, mitigating the need for time-expensive annotations. We achieve this by treating unlabeled words as training examples for an autoencoding (Vincent et al., 2010) task and multi-task training (cf. Figure 1). We intuit the following reasons why this should be beneficial: (i) The decoder's character language model can be trained using unlabeled data. (ii) Training on a second task reduces the problem of overfitting. (iii) By forcing the model to additionally learn autoencoding, we give it a strong prior to copy the input string. This might be advantageous as often many forms of a paradigm share the same stem, e.g., smiling and smiled. In order to investigate the importance of the latter, we further experiment with autoencoding of random strings and find that for our experimental settings and non-templatic languages the performance gain is comparable to using corpus words.

Model Description
The log-likelihood for joint training on the tasks of MRI and autoencoding is: (1) T is the MRI training data, with each example consisting of a source form f s , a target form f t and a target tag t. W denotes a set of words in the language of the system. The encoding function e depends on θ. The parameters θ are shared across the two tasks, resulting in a share of information. We obtain this by giving our model data from both sets at the same time, and marking each example with a task-specific input symbol, cf. Figure 1.
Encoder. For the input of the encoder, we adapt the format by Kann and Schütze (2016), but modify it to be able to handle unlabeled data: Given the set of morphological subtags M each target tag is composed of (e.g., the tag 1SgPresInd contains the subtags 1, Sg, Pres and Ind), and the alphabet Σ of the language of application, our input is of the form B[A/M * ]Σ * E, i.e., it consists of either a sequence of subtags or the symbol A signaling that the input is not annotated and should be autoencoded, and (in both cases) the character sequence of the input word. B and E are start and end symbols. Each part of the input is represented by an embedding. We then encode the input x = x 1 , x 2 , . . . , x Tx using a bidirectional gated recurrent neural network (GRU) (Cho et al., 2014b) x i , with f being the update function of the hidden layer. Forward and backward hidden states are concatenated to obtain the input h i for the decoder.
Decoder. The decoder is an attention-based GRU, defining a probability distribution over strings in Σ * : with s t being the decoder hidden state for time t and c t being a context vector, calculated using the encoder hidden states together with attention weights. A detailed description of the model can be found in Bahdanau et al. (2015).

Experiments
Dataset. We experiment on the task 3 dataset of the SIGMORPHON 2016 shared task on MRI (Cotterell et al., 2016) and all standard languages provided: Arabic, Finnish, Georgian, German, Navajo, Russian, Spanish and Turkish. German, Spanish and Russian are suffixing and exhibit stem changes. Russian differs from the other two in that those stem changes are consonantal and not vocalic. Finnish and Turkish are agglutinating, almost exclusively suffixing and have vowel harmony systems. Georgian uses both prefixiation and suffixiation. In contrast, Navajo mainly makes use of prefixes with consonant harmony among its sibilants. Finally, Arabic is a templatic, nonconcatenative language.
For each language, we further add randomly sampled words from the respective Wikipedia dumps. We exclude tokens that are not exclusively composed from characters of the language's alphabet, e.g., digits, or do not appear at least 2 times in the corpus. The exact amount of unlabaled data added is treated as a hyperparameter depending on the number of available annotated examples and optimized on the development set, cf. Section 4.1. Evaluation is done on the official shared task test set.
Training, hyperparameters and evaluation. We mainly adopt the hyperparameters of (Kann and Schütze, 2016).
Embeddings are 300dimensional, the size of all hidden layers is 100 and for training we use ADADELTA (Zeiler, 2012) with a batch size of 20. We train all models which use 1 8 or more of the labeled data for 200 epochs, and models that see 1 16 and 1 32 of the original data for 400 and 800 epochs, respectively. In all cases, we apply the last model for testing.
We evaluate using two metrics: accuracy and edit distance. Accuracy reports the percentage of completely correct solutions, while the edit distance between the system's guess and the gold solution gives credit to systems that produce forms that are close to the right form.
Baselines. We compare our system to three baselines: The first one is MED 1 , the winning sys-  Table 1: Accuracy (the higher the better) and edit distance (the lower the better) for our system and the three baselines on the official test set of task 3 of the SIGMORPHON 2016 shared task. Only the indicated amount (row labels) of the original training data is used, emulating a low-resource setting. Best results for each language in bold.
tem of the 2016 shared task. The network architecture is the same as in our system, but it is trained exclusively on labeled data. Thus, we expect it to suffer stronger from a lack of resources. The second baseline is the official SIGMOR-PHON 2016 shared task baseline (SIG16) (Cotterell et al., 2016), which is similar in spirit to the system described by Nicolai et al. (2015). The system treats the prediction of edit operations to be performed on the input string as a sequential decision-making problem, greedily choosing each edit action given the previously chosen actions. The selection of operations is made by an averaged perceptron, using the binary features described in (Cotterell et al., 2016). 2 Third, we compare to the baseline system of the CoNLL-SIGMORPHON 2017 shared task on universal morphological reinflection (SIG17) (Cotterell et al., 2017), which is extremely suitable for low-resource settings. It splits all source and target forms in the training set into prefix, middle part and suffix, and uses those to find prefix or suffix substitution rules. Every evaluation example is searched for the longest contained prefix or suffix and the rule belonging to the affix and given target tag is applied to obtain the output. Table 1, additionally training on unlabeled examples improves the performance of the encoder-decoder network for nearly all settings and languages, especially for the very low-resource scenarios with 1 16 and 1 32 of the training data. The biggest increase in accuracy can be seen for Russian and Spanish, both in the 1 32 setting, with 0.0963 (0.5023 − 0.4060) and 0.0992 (0.7564 − 0.6572), respectively. For the settings with bigger amounts 2 Note that our use of the system differs from the official baseline in that we perform a direct form-to-form mapping. The shared task system predicts first form-to-lemma and then lemma-to-form. However, we assume no lemmata to be given, and thus are unable to train such a system. of training data available, the unlabeled data does not change performance a lot. This was expected, as the model already gets enough information from the annotated data. However, semisupervised training never hurts performance, and can thus always be employed. Overall, our semisupervised training method shows to be a useful extension of the original system. Furthermore, there is only one case-Georgian, 1 16 -where any of the SIGMORPHON baselines outperforms the neural methods. This clearly shows the superiority of neural networks for the task and emphasizes the need to reduce the amount of labeled training data required for their training.

Amount of Unlabeled Data
We now consider the amount of unlabeled examples as a function of the number of annotated examples. Data and training regime are the same as in Section 3. This analysis is performed on the development set and we report the highest accuracy obtained during training. The resulting accuracies for Arabic and German can be seen in Figure 2. The other languages behave similarly to German. The loss of performance for reducing the training data varies a lot between languages, depending on how regular and thus "easy to learn" those are. Concerning the amount of unlabeled examples, it seems that even though in single cases other ratios are slightly better, using 4 times more unlabeled examples mostly obtains highest accuracy. Thus, a general rule could be that the more additional examples are used the better. The only exception is Arabic in the 1 32 setting, where using half as many unlabeled as labeled examples obtains much better results. We explain this with the Semitic language being templatic. Since words in Arabic paradigms do not share a connected stem, we expect that giving the model too much bias to copy might be harming performance in low-resource settings. However, even for low-resource Arabic, using a ratio of 1:4 of labeled to unlabeled examples still yields a better performance than not using unlabeled examples at all. Thus, we can conclude that if aiming for a language-independent setup, this is a good ratio.

Autoencoding of Random Strings
We expect the network to benefit from a bias to copy strings. This suggests that any random combination of characters from the language's alphabet could be autoencoded in order to improve the performance in low-resource settings. To verify this, we train models on new datasets with 1 32 of the labeled examples from task 3 of the SIGMOR-PHON 2016 shared task and the optimal number of unlabeled examples for each language, cf. §4.1. However, the unlabeled examples are now random strings of a length between 3 and 20. All models are trained as before. Accuracies on the official test sets are shown in Table 2, and compared to (i) training without unlabeled examples and (ii) the data being enhanced by corpus words. Several aspects of the results are eye-catching. First, for Arabic, the gap to the performance with corpus words is the biggest, showing that indeed the ar fi ka de nv ru es tu MED . 2628.3144 .8184 .6608 .1738 Table 2: Accuracies for MED (Kann and Schütze (2016)), MED+corpus and MED+random. Descriptions in the text. tendency of languages to copy the stem when inflecting is playing an important role. Second, for some languages the performance gains for corpus words and random words are comparable. Third, the performance of random strings is closer to the performance of corpus words the higher the overall accuracy is. The additional unlabeled examples might be acting as regularizers in this case.
Overall, this experiment shows clearly that giving the model a bias to copy strings helps for inflection in non-templatic languages, and that random strings can improve a network for MRI.

Related Work
For the SIGMORPHON 2016 and the CoNLL-SIGMORPHON 2017 shared tasks (Cotterell et al., 2016(Cotterell et al., , 2017, multiple MRI systems were developed, e.g., (Nicolai et al., 2016;Taji et al., 2016;Kann and Schütze, 2016;Aharoni et al., 2016;Östling, 2016;Makarov et al., 2017). Encoder-decoder neural networks (Cho et al., 2014a;Sutskever et al., 2014;Bahdanau et al., 2015) performed best, such that we extend them in this work. Earlier work on paradigm completion included (Faruqui et al., 2016;Nicolai et al., 2015;Durrett and DeNero, 2013). Work directly tackling MRI was more rare, e.g., (Dreyer and Eisner, 2009). Our work relates to the line of research on minimally supervised and unsupervised methods for morphology, e.g., Creutz and Lagus (2007) and Goldsmith (2001) presenting the unsupervised morphological segmentation systems Morfessor and Linguistica, or (Dreyer and Eisner, 2011;Poon et al., 2009;Snyder and Barzilay, 2008). However, none of those focused directly on MRI or on training neural networks for morphology. The only case we know of where this was done was work by Kann et al. (2017). They leveraged morphologically annotated data in a closely related high-resource language to reduce the need for labeled data in the target language. This works well for similar languages, but has the shortcoming to require annotations in such a language to be at hand. A similar approach was presented by Ha et al. (2016) for machine translation (MT).
Unlabeled corpora were used for semi-supervised training of models for MT, e.g., by Cheng et al. (2016); Vincent et al. (2010); Socher et al. (2011); Ramachandran et al. (2016). Those approaches differ from ours, due to a fundamental difference between the two tasks: For MRI, the source vocabulary and the target vocabulary are mostly the same. This makes it intuitive for MRI to train the final model jointly on MRI and autoencoding.

Conclusion
We presented a way of semi-supervised training of a state-of-the-art model for low-resource MRI, using words from an unlabeled corpus. We found that the best ratio of labeled to unlabeled data depends of the morphological typology of the language. Finally, we showed that autoencoding random strings also increases performance, for some languages as much as using corpus words.