AX Semantics’ Submission to the SIGMORPHON 2019 Shared Task

This paper describes the AX Semantics’ submission to the SIGMORPHON 2019 shared task on morphological reinflection. We implemented two systems, both tackling the task for all languages in one codebase, without any underlying language specific features. The first one is an encoder-decoder model using AllenNLP; the second system uses the same model modified by a custom trainer that trains only with the target language resources after a specific threshold. We especially focused on building an implementation using AllenNLP with out-of-the-box methods to facilitate easy operation and reuse.


Introduction
This paper describes our implementation and results for Task 1 of the 2019 Shared Task (Mc-Carthy et al., 2019). The task is to generate inflected word forms given the lemma and a morphological feature specification (Kirov et al., 2018). See Figure 1 for an example in German, where a verb lemma is inflected according to the specified number, mood, tense and person.  In contrast to last year, where the training data was only in the respective target language, this year the given data consists of up to 10000 exemplars of one high resource language combined with up to 100 exemplars of a low resource language. The target language is the low resource language. The task is to use the high resource data to improve the inflection of the low resource language. Including the surprise language pairs the task consists of 99 language pairs.

Motivation
After participating last year (Madsack et al., 2018) we started to rebuild everything we needed for our production system using AllenNLP (Gardner et al., 2017). Our main goal here is reproducibility and full logging of everything as default. In our experience AllenNLP brings best practices that, while sometimes opinionated, are way better than building everything from scratch, and which we wanted to apply to this problem.
Our two systems represent our learning curve in the attempt to solve the given shared task. The first system is a solution entirely based on given AllenNLP components. The second system has a custom trainer that, only at the start, trains with all given training data for a pair and then continues only with the (low-resource) target language.
The source code of our submission can be found at: https://301.ax/ github-sigmorphon2019 3 System 1 -softmax baseline in AllenNLP Our first system is the soft-attention baseline rebuilt in AllenNLP. It basically serves as a starting point for our second system. The model is an encoder-decoder (Cho et al., 2014) and is using the readily implemented version in AllenNLP (named SimpleSeq2Seq). We modified the model code to add accuracy and editdistance metrics. The attention used is dot-product attention (Luong et al., 2015). All other hyper parameters are inspired by Wu et al. (2018) and shown in Table 1.
We trained two kinds of System 1. One with only low data as baseline and another with high and low data concatenated. All systems used here are character based and the input sequence is first the lemma followed by a next marker (we used a tabulator) followed by the morphological features as a string. One example input of the encoder looks like the following: zmrzlina N;DAT;SG. Besides, AllenNLP wraps inputs and target outputs with start and end markers. 4 System 2 -transfer learning The second system uses the same encoderdecoder-model as System 1. The major modification is a trainer that first learns on all training data (high and low resource data) and after a threshold is reached continues learning only with the target language. This threshold marks the transfer learning point: The cross-lingual model is reused as a basis for training with the monolingual data. The first 10 epochs are always trained with all training data. The switch to only training with low resource data happens after 5 epochs without training improvement. As metric for this improvement a lower loss on validation data is used. Most hyper parameters (Table 1) for System 2 were halved after some experimental evaluation. We did not do an exhaustive search of these parameters, so minor improvements with the help of hyperparameter optimization (e.g. using cross-validated grid search) are possible here. Figure 3 shows a loss curve where training with only the target language Khakas (and without the high-resource language Bashkir) started at epoch 17. For comparison the loss curve of System 1 for the same language pair is shown in Figure 2. In this example System 2 gains a smaller loss than System 1 on the validation data -System 2 reaches with about 0.2 half as much loss as System 1 with about 0.4.

Results
The results for System 1 and System 2 shown in Table 2 and Table 3 together with the soft-attention  (blue) for System 2 language pair "bashkir-khakas" baseline from the organizers (Wu and Cotterell, 2019) are the unmodified results from the submission to the task. We found minor tooling mistakes on the surprise languages after the submission deadline which we didn't correct in the table.
In the trained models we can observe big differences in the accuracy for different language pairs. To better understand the results we trained a new version of System 1 with only the low data given and ignored the high-resource language data completely. As expected this version of System 1 performed worst in comparison to the other systems due to the lack of a sufficient amount of training data.
In general, the very low results on some pairs seem to be based on very different character sets and/or feature sets between the concerning language pairs. For example a language pair with a lot of different characters and different features is "bengali-greek". The amount of Greek data alone is not enough to train an encoder-decoder-model (see System 1 low results) and the data for Bengali doesn't help either way (see System 1, System 2 and baseline results).
Thus, the results indicate that a difference in features and/or character sets has a big impact on the usefulness of the high resource training data. For the character set a phonological mapping to a phonetic alphabet could improve on that issue.

Conclusion
Our continual goal is to improve our morphology system component in our Natural Language Generation SaaS (Weißgraeber and Madsack, 2017). In our production setup the System 1 described above competes against a handcrafted morphology and a reasonable lexicon (which were not used for the Shared Task). This handcrafted morphology together with the lexicon is always better on very regular part of speech (POS) types (i.e. German adjectives). Therefore not for every language POS combination a system shown here is used in our production NLG inflection system. For every language and POS type we evaluate which solution fits best.
AllenNLP successfully helped us to reproduce the same results even with newer versions of libraries (i.e. PyTorch, CUDA, Python), which is an important quality for our NLG system.