Experiments on Morphological Reinflection: CoNLL-2018 Shared Task

We present a system for the task of morphological inﬂection, i.e., ﬁnding a target morphological form, given a lemma and a set of target tags. System is trained on datasets of three sizes: low, medium and high. The system uses a simple Long Short-Term Memory (LSTM) based encoder-decoder based model. The performance for low size dataset is poor in general while it improves signiﬁcantly for medium and high sized training dataset. The average performance over all languages is poor as compared to baseline for low dataset, it is comparable for medium dataset, and signiﬁcantly more for high dataset.


Introduction
The CoNLL-SIGMOPRHON 2018 shared task consists of two subtasks out of which we participate only in the first subtask, which involves generating a target inflected form from a given lemma with its morphosyntactic descriptions (MSDs) provided as a set of features.For instance, the word thinking is the present continuous inflected form of the lemma think.The models were trained on three differently-sized datasets.The low-sized datasets had around 100 training samples, the medium-sized datasets had around 1000 training samples and the high-sized datasets had around 10000 samples for most languages.Datasets were provided for a total of 103 languages including surprise data.

Background
Prior to neural network based approaches to morphological reinflection, most systems used a 3step approach to solve the problem: 1) String alignment between the lemma and the target (morphologically transformed form), 2) Rule extraction from spans of the aligned strings and 3) Rule application to previously unseen lemmas to transform them.(Durrett and DeNero, 2013) and (Ahlberg et al., 2014(Ahlberg et al., , 2015) ) used the above approaches, with each of them using different string alignment algorithms and different models to extract rules from these alignment tables.However, in these kinds of systems, the types of rules to be generated must be specified, which should also be engineered to take into account language-specific transformational behaviour.(Faruqui et al., 2016) proposed a neural network based system which abstracts away the above steps by modeling the problem as one of generating a character sequence, character-by-character. (Kann and Schütze, 2016) proposed a highly competetive implementation in previous year tasks (Cotterell et al., 2016(Cotterell et al., , 2017)).
Akin to machine translation systems, this system uses an encoder-decoder LSTM model as proposed by (Hochreiter and Schmidhuber, 1997).The encoder is a bidirectional LSTM, while the decoder LSTM feeds into a softmax layer for every character position in the target string.Decoder predicts the output sequence character by character using feedback until stop is predicted.This model takes into account the fact that the target and the root word are similar, except for the parts that have been changed due to inflection, by feeding the root word directly to the decoder as well.A separate neural net is trained for every language.

System Description
We have modelled our system based on the system proposed by (Faruqui et al., 2016), as described in the previous section.However we have made some modifications to the above system, to account for the three different sizes of datasets and to account for the behaviour of morphological transformations of independent languages.
In the model, some structural and hyperparameter features remain the same.The characters in the root word and morphological features of the tatrget word are represented using one hot vectors.The major change in our model is the size of LSTM layers which is kept variable (depending on vocabulary size) as opposed to fixed as in system proposed by (Faruqui et al., 2016) based on assumption that bigger vocabulary would require bigger layers to extract features and system is trained for more epochs.
The embedding size for each language is different depending upon the alphabet set of that language available in the given dataset and similarly for morphological tags which are split into individual components.We use a bidirectional encoder to which we feed the input word embeddings.The output of the encoder, concatenated with the root word embedding and morphological features, feeds into the decoder.All recurrent units have variable hidden layer dimensions depending upon the embedding size of root word and morphological features.Over the decoder layer is a softmax layer that is used to predict the character that must occur at each character position of the target word.In order to maintain a constant word length, we use paddings of 0 characters.All models use categorical cross-entropy as the loss function and the RMSProp optimizer for optimization.
The model was trained for 100 epochs for each size.Keras API (Chollet et al., 2015) was used for writing neural networks.For low dataset, batch size of 10 was used, for medium 100, and for high 250/500 depending of hardware limitations.

Submission
Following are tables showing top 5 accuracies obtained by our system on test data as opposed to baseline model.

Language
Baseline

Evaluation 4.1 Results on Test Set
The evaluation results were obtained using the evaluation script and the test set provided by the shared task organizers.
The best five baseline accuracies, accuracies for the first submission and accuracies for the second submission can be found in Table 1, Table 2 and Table 3 for each of the three dataset sizes: low, medium and high respectively.
The complete set of accuracies and Levenshtein distances for all languages have been included in Appendix (tables 4 to 6).

Observations
We performed some experiments, where the choice of hyperparameters was guided by intuitions developed from analysis of the dataset and results obtained on smaller subsets of the data.We have presented some key observations from our analysis in the ensuing sub-sections.

Number of layers
We observed that increasing number of layers does not result in significant increase in performance, even reduced performance in some cases whereas increased computation time significantly.So instead of adding more layers, adding more complexity and features in current layer is bound to improve performance.

Embedding of Morphological features
Multiple types of embedding to represent morphological features was tried some of which were: binary vectors, one hot vectors, integer vectors.One hot vectors resulted in best performance for our model.

Size of encoder layer
Increasing size of encoder after certain multiple of total embedding size (∼5) results in saturation of performance.

Hyperparameter Optimization
Various hyperparameters need to be optimized such as batch-size, dropout rate, number of epochs etc. which may be different for each language, to obtain optimal performance.

Conclusions
There are two main conclusions.One is that different configurations of deep neural networks work well for different languages.The second is that deep learning may not be the right approach for low-sized data or some other pre-processing and post-processing may need to be done to increase performance.Data augmentation is one alternative to deal with low resource languages.

A Appendix
In Tables 4 to 6 (on this page and the next), BA stands for baseline accuracy, L.D. for Levenshtein Distance, Acc for Accuracy, dev for development data.

Figure 1 :
Figure 1: C1, .., Cn represent characters of the root word while O1, ..,On represent characters of the output word.

Table 1 :
Top 5 Accuracies for languages for low data

Table 2 :
Top 5 Accuracies for languages for medium data

Table 3 :
Top 5 Accuracies for languages for high data

Table 4 :
Results for all languages for low data.

Table 5 :
Results for all languages for medium data.

Table 6 :
Results for all languages for high data.