SU-RUG at the CoNLL-SIGMORPHON 2017 shared task: Morphological Inflection with Attentional Sequence-to-Sequence Models

This paper describes the Stockholm University/University of Groningen (SU-RUG) system for the SIGMORPHON 2017 shared task on morphological inﬂec-tion. Our system is based on an attentional sequence-to-sequence neural network model using Long Short-Term Memory (LSTM) cells, with joint training of morphological inﬂection and the inverse transformation, i.e. lemmatization and morphological analysis. Our system outperforms the baseline with a large margin, and our submission ranks as the 4 th best team for the track we participate in (task 1, high-resource).


Introduction
We focus on task 1 of the SIGMORPHON 2017 shared task (Cotterell et al., 2017), morphological inflection.The task is to learn the mapping from a lemma and morphological description to the corresponding inflected form.For instance, the English verb lemma torment with the features 3.SG.PRS should be mapped to torments.As our model is poorly suited for low-resource conditions, we only submitted results for the 51 languages with highresource training data available in the shared task (i.e., excluding Scottish Gaelic).

Background
The results of the SIGMORPHON 2016 shared task (Cotterell et al., 2016) indicated that the attentional sequence-to-sequence model of Bahdanau et al. (2014) is very suitable for this task (Kann and Schütze, 2016), so we use this framework as the basis of our model.
A recent trend in neural machine translation is to use back-translated text (Sennrich et al., 2016) as a way to benefit from additional monolingual data in the target language.There is also work on translation models with reconstruction loss, which encourages solutions that can be translated back to their original (Tu et al., 2016).These developments are technically similar to our semisupervised training below.

Method
Our system is based on the attentional sequenceto-sequence model of Bahdanau et al. (2014) with Long Short-Term Memory (LSTM) cells (Hochreiter and Schmidhuber, 1997) and variational dropout Gal and Ghahramani (2016).The main innovation is that our inflection model is trained jointly with the reverse process, that is, lemmatization and morphological analysis.This can be done in two ways: 1. Fully supervised, where we simply train the forward (inflection) and backward (lemmatization and morphological analysis) model jointly with shared character embeddings.
2. Semi-supervised, where supervised examples are mixed with examples where only the inflected target form is used.This form is passed first through the backward model, a greedy search to obtain a unique lemma, and finally through the forward model to reconstruct the inflected form.
Our official submission only includes results from fully supervised training (method 1), due to time constraints, but Section 5 contains a comparison between the two versions on the development set.
The system architecture is shown in Figure 1 for the forward (inflection) model.The backward (lemmatizer) model has separate parameters, except the embeddings, but is structurally identical except for two details: instead of passing the morphological feature information to the decoder (via a single fully connected layer), we predict the features from the final state of the encoder LSTM (via a separate fully connected layer).

Model configuration
For the official submission, we use 128 LSTM cells for the (unidirectional) encoder, decoder, attention mechanism, character embeddings, as well as for the fully connected layers for morphological features encoding/prediciton.We use a dropout factor of 0.5 throughout the network, including the recurrent parts.For optimization, we use Adam (Kingma and Ba, 2015) with default parameters.Each model is trained for 48 hours on a single CPU, using a batch-size of 64, and the model parameters during this time that give the lowest development set mean Levenshtein distance are saved.For the official submission, we used an ensemble of two such models, using a beam search of width 10 to select the final inflection candidate.

Results and Analysis
The system has high performance in general, with a macro-average accuracy of 93.6%, and edit dis-tance of 0.14.This is substantially higher than the baseline (77.8% accuracy and 0.5 edit distance), and ranks as the 9 th best run, and 4 th best team in this SIGMORPHON 2017 shared task setting.Furthermore, the difference in scores between our run and the best run overall is low (1.75% accuracy and 0.04 edit distance).Table 1 contains a detailed version of the official results our system on the shared task, in the high setting of Task 1. Notably, the system has an accuracy of 100% on both Basque and Quechua, which indicates that it is capable of fully learning the rules of very regular morphological systems.The relatively high accuracy on Semitic languages (Arabic: 89.8%, Hebrew: 99.0%) again confirms the ability of encoder-decoder models to also handle non-concatenative morphology.
Latin has the lowest accuracy by far, and the reason seems to be that the provided shared task data lacks vowel length distinctions in the lemma but uses them in the inflected forms.This missing lexical information is difficult to predict accurately.Evaluating with vowel length distinctions gives an accuracy of 75.6% (Latin development set), compared to 91.5% without.The latter accuracy score is in line with other Romance languages (French 90.8%, Spanish 94.3%, Italian 97.0%).
We also investigated whether the semisupervised approach described in Section 3 has any effect on accuracy.The results on the development set, presented in there is no systematic effect (the macro-averaged accuracy drops marginally from 93.9% to 93.8%).

Conclusions
We implemented a system using an attentional sequence-to-sequence model with Long Short-Term Memory (LSTM) cells.As our model is poorly suited for low-resource conditions, we only participated in the high-resource setting.Our inflection model is trained jointly with the reverse process, that is, lemmatization and morphological analysis.The system significantly outperforms the baseline system, and performs well compared to other submitted systems, showing that this approach is very suitable for morphological inflection, given sufficient amounts of data.

Figure 1 :
Figure 1: System architecture, consisting of an attentional sequence-to-sequence model with LSTMs.

Table 2 ,
indicate that

Table 1 :
Our system's official results on the SIGMORPHON-2017 shared task-1 test set in the high setting.

Table 2 :
Our system's result on the SIGMORPHON-2017 shared task-1 development set, comparing fully supervised training (Full) to our semi-supervised method (Semi).