Morphological reinflection with convolutional neural networks

We present a system for morphological reinﬂection based on an encoder-decoder neural network model with extra convolutional layers. In spite of its simplicity, the method performs reasonably well on all the languages of the SIGMORPHON 2016 shared task, particularly for the most challenging problem of limited-resources reinﬂection (track 2, task 3). We also ﬁnd that using only convolution achieves surprisingly good results in this task, surpassing the accuracy of our encoder-decoder model for several languages.


Introduction
Morphological reinflection is the task of predicting one form from a morphological paradigm given another form, e.g. predicting the English present participle ringing given the past tense rang. The SIGMORPHON shared task considers three variants of this problem, with decreasing amounts of information available beyond the source form and the morphological features of the target form: 1. The source form is always the citation form. 2. The source form's morphological features are not fixed, but given. 3. Only the source form itself is given.
The first and simplest case is the most wellresearched, and is essentially equivalent to the task of predicting morphological paradigms. This paper presents our system for morphological reinflection, which was submitted for the SIG-MORPHON 2016 shared task. To complement the description given here, the source code of our implementation is available as free software. 1

Background
In general, morphological reinflection can be solved by applying any technique for morphological analysis followed by morphological generation. These tasks have traditionally been performed using manually specified rules, a slow and expensive process. Recently, there has been an increased interest in methods for learning morphological transformations automatically from data, which is also the setting of the SIGMORPHON 2016 shared task.
This work is based on that of Faruqui et al. (2016), who use a sequence-to-sequence model similar to that commonly used in machine translation . Their method is very simple: for each language and morphological feature set, they train a separate model with a character-level bidirectional LSTM encoder (where only the final hidden states are used), and an LSTM decoder whose inputs are the encoded input as well as the input character sequence.

Model
We propose modifying the model of Faruqui et al. (2016) by: 1. using a single decoder, rather than one for each combination of morphological features (which could lead to data sparsity for languages with complex morphology and large paradigms), 2. using both the raw letter sequence of the source string and its convolution as inputs, 3. using deeper LSTM units for the decoder.
Although this model was originally designed for inflection generation given a lemma, it can trivially be used for reinflection by using inflected forms rather than lemmas as input. Thus, we use exactly the same model for the first and third task, Morph. feature vector and for the second task where morphological features are given for the source form, we include those features along with the target form features (which are given in all three tasks).
In our experiments, we use 4 convolutional layers and 2 stacked LSTMs (Hochreiter and Schmidhuber, 1997). We use 256 LSTM units (for both the encoder and decoder), 64-dimensional character embeddings and 64 convolutional filters of width 3 for each layer. The LSTM outputs were projected through a fully connected hidden layer with 64 units, and finally through a fully connected layer with softmax activations over the alphabet of the language in question. Morphological features are encoded as binary vectors, which are concatenated with the character embeddings (and, when used, convolved character embeddings) to form the input of the decoder. We then used the Adam algorithm (Kingma and Ba, 2014) for optimization, where the training objective is the crossentropy of the target strings. For decoding, we use beam search with a beam size of 4. The model architecture is summarized in figure 1.
To further explore the effect of using convolutional layers in isolation, we also performed follow-up experiments after the shared task submission. In these experiments we used an even simpler architecture without any encoder, instead we used a 1-dimensional residual network architecture (He et al., 2016,  size across layers, followed by either one or zero Gated Recurrent Unit layers (Cho et al., 2014). The output vector of each residual layer (which contains two convolutional layers with Batch Normalization (Ioffe and Szegedy, 2015) and rectified linear units after each) is combined with the vector of the previous layer by addition, which means that the output is the sum of the input and the output of each layer. This direct additive coupling between layers at different depth allows very deep networks to be trained efficiently. In this work we use up to 12 residual layers, corresponding to a total of 24 convolutional layers.
In these experiments (unlike the encoderdecoder model), dropout (Srivastava et al., 2014) was used for regularization, with a dropout factor of 50%. The morphological features of the target form are concatenated to the 128-dimensional character embeddings at the top convolutional layer, so the total number of filters for each layer is 128 + n in order to keep the architecture simple and uniform, where n is the number of different morphological features in the given language. Decoding is done by choosing the single most probable symbol at each letter position, according to the final softmax layer. This model is summarized in figure 2.

Evaluation
All results reported in this section refer to accuracy, computed using the official SIGMORPHON 2016 development data and scoring script. Table 1 on the following page shows the result on the official test set, and a full comparison to other systems is available on the shared task website 2 (our system is labeled 'HEL'). We participate only in track 2, which only allows training data from the same task that is evaluated. Training data from other (lower-numbered) tasks, as track 1 allows, could trivially be appended to the training data of our model, but this was not done since we focused on exploring the core problem of learning reinflection. The same constraints are followed in all experiments described here.
Note that due to time constraints, we were not able to explore the full set of parameters before submitting the test set results. Of the models that had finished training by the deadline, we chose the one which had the highest accuracy on the development set. The results reported here are from later experiments which were carried out to systematically test the effects of our proposed changes. Table 2 shows that using convolutional layers improves accuracy in almost all cases, whereas adding an extra LSTM layer does not bring any systematic improvement.
Results when using only convolutional layers or convolutional layers followed by a GRU recurrent layer can be found in table 3 on the following page. To our surprise, we found that convolution alone is sufficient to achieve results comparable to or better than several of the other systems in the shared task, and for some languages it beats our own submitted results. There is no clear benefit across languages of adding a final GRU decoder Table 2: Results of our convolutional encoderdecoder system on the official SIGMORPHON shared task development set for task 3 (reinflection). The first column contains results of models with both convolutions (4 layers) and deep LSTMs (2 layers), the second uses a single LSTM layer, and the third one uses no convolutional layers.
Language Accuracy ( layer, but increasing the depth of the network and in particular the width of the convolution seem to benefit accuracy.

Conclusions
We find that the model of Faruqui et al. (2016) can be extended to the task of reinflection and delivers very good levels of accuracy across languages, and that adding convolutional layers consistently improves accuracy. Further experiments show, to our surprise, that a simple and purely convolutional architecture designed for image classification in many cases achieves an even higher accuracy. Although convolutional architectures have become standard (along with recurrent neural networks) in many text encoding tasks, this is one of rather few examples of where they have been successfully used for text generation.