ISI at the SIGMORPHON 2017 Shared Task on Morphological Reinflection

We present a system for morphological reinﬂection based on the LSTM model. Given an input word and morphosyntactic descriptions, the problem is to classify the proper edit tree that, applied on the input word, produces the target form. The proposed method does not require human deﬁned features and it is language independent also. Currently, we evaluate our system only for task 1 without using any external data. From the test set results, it is found that the proposed model beats the baseline on 15 out of the 52 languages in high resource scenario. But its performance is poor when the training set size is medium or low.


Introduction
The morphological reinflection task is to generate the variant of a source word, given the morphosyntactic descriptions of the target word. This year's shared task (Cotterell et al., 2017) is divided into two sub-tasks. Task 1 demands to inflect the isolated word forms based on labelled training data. For example, given the source form 'communicate' and the features 'V;3;SG;PRS', one has to predict the target form 'communicates'. Whereas, in task 2, partially filled incomplete paradigms are provided. The goal is to complete them using a restricted number of full paradigms. For each of the tasks, 3 separate training files are given per language, which differ in size (low/medium/high), in order to analyze systems' generalization ability in low and high resource situations. The competition is spread over 52 languages. For each language, a finite set of morphological tags are provided, from which the target inflections are taken. Evaluation is done separately under each of the three different training sets. To make the shared task competition fair, use of external resources are forbidden for the main competition track. However, for those systems which make use of external monolingual corpora, a list of approved external corpora selected from the Wikipedia text dumps are provided.
So far, there have been several efforts on reinflection employing statistical learning based methods (Dreyer and Eisner, 2011;Durrett and DeNero, 2013;Ahlberg et al., 2015;King, 2016) and string transduction (Nicolai et al., 2015). These methods entail feature definition which is hard to generalize for all of the world's languages.
In this article, we introduce a long short-term memory (LSTM) network architecture to handle the morphological reinflection task. The proposed method is language independent and does not require features to be defined manually. Our model is related to the encoder-decoder based approaches such as (Aharoni et al., 2016;Faruqui et al., 2016;Kann and Schütze, 2016a,b), but the main difference is that the proposed network is not designed to generate sequence of characters as output. Rather, we formulate the problem as to classify the transformation process required to convert a source form to its target form (Chakrabarty et al., 2017). Our goal is to model such a system which receives an input word and the morphological tags and returns the proper transformation that induces the target word. The source-target transformation is accomplished using edit tree (Chrupala et al., 2008;Müller et al., 2015). Initially all edit trees are extracted from the labelled pairs in the training data and then the distinct candidates from them are marked as the class labels. We feed the character sequence of the input word through the LSTM network to encode it and finally, the encoded representation is jointly trained with the input tags to classify the correct edit tree.
Currently, we assess our system only for task 1 on all 52 languages, though it can be used for task 2 also. No external data such as the Wikipedia dumps provided by the SIGMORPHON committee has been exploited in the present work. The results obtained from the test sets indicate that the proposed method is resource intensive. When the training size is high, it achieves over the baseline system on 15 out of the 52 languages. But on medium and low amount training data, the performance is poor beating the baseline on 5 and 4 languages only.

Methodology
Edit Trees: An edit tree encodes a transformation which maps a source string to a target string. Given a source-target pair, the process of finding the corresponding edit tree is as follows. At first, the longest common substring (LCS) between them is found and then the prefix and suffix pairs of the LCS are recursively modelled in the same manner. The edit tree does not encode the LCS itself. Instead, it contains the length of the prefix and suffix in the source string for generalization. When no LCS is found between the source and the target strings, they are kept as a substitution node. Figure 1 shows an example of edit trees between the source-target pair 'sang-sing'. The LCS between them is 'ng'. In the source string, the prefix length of the LCS is 2 (for 'sa') and the suffix length is 0. So, the root of the edit tree keeps the information (2, 0). The left subtree of the root represents the edit tree between the prefix pair of the LCS in the source and the target string i.e. for 'sasi' and following the same way, the right subtree remains empty.
Note that, to generalize the transformation pattern, the LCS is not stored in the edit tree. Con- sider the two source-target pairs 'gives-give' and 'takes-take' where the transformation rule is same i.e. to omit the ending 's' character. The root of the corresponding edit tree contains (0, 1). If the LCS were stored in the root, then the tree could not be generalized for all the pairs like 'comes-come', 'sleeps-sleep' etc. where the same rule works.

The System Description
The architecture of our system is presented in Figure 2. At first, we use the LSTM network to make a syntactic embedding of the source word, that captures the morphological regularities. A character alphabet of the concerned language is defined as C. Let the input word w consists of the character sequence c 1 , . . . , c m where the word length is m and each character c i is represented as a one hot encoded vector 1 c i . The particular dimension of 1 c i referring to the index of c i in the alphabet C, is set to one and the other dimensions are made zero. 1 c 1 , . . . , 1 cm are passed to an embedding layer E c ∈ R dc×|C| , which projects them to d c dimensional vectors e c 1 , . . . , e cm , by doing the operation e c i = E c · 1 c i where '·' denotes matrix multiplication.
When the sequence of vectors e c 1 , . . . , e cm is given to the LSTM network, it computes the state sequence h 1 , . . . , h m using the following equa-tions: σ denotes the sigmoid function and ⊙ stands for the element-wise (Hadamard) product. LSTM utilizes an extra memory c t that is controlled by three gates -input (i t ), forget (f t ) and output (o t ). W, U, V (weights), b (bias) are the parameters. Eventually, we take the final state h m as the encoded representation of w.
In addition to the source word, we have morphosyntactic features in hand to predict the target form. From the training data, all distinct features are sorted out to make a feature dictionary F . For a training sample, the given features are mapped to |F | dimensional feature vector f = (f 1 , . . . , f |F | ) where f i = 1 if the i th feature in the dictionary is present in the input features, otherwise f i is set to 0. Thus, f becomes a numeric representation of the input features for the present training sample.
Another important point is that, for any arbitrary input word, all unique edit trees in the training data are not applicable due to incompatible substitutions. For example, the edit tree for the source-target pair 'sang-sing' (shown in Figure 1) cannot be applied on the word 'sleep'. In spite of all unique edit trees are set as the class labels, few of them are applicable for an input word to the model. To sort out this issue, we put the information over which classes the model should distribute the output probability mass while training.
Let T = {t 1 , . . . , t k } be the distinct edit trees set extracted from the training data. For the input word w, its applicable edit trees vector is defined as a = (a 1 , . . . , a k ) where ∀j ∈ {1, . . . , k}, a j = 1 if t j is applicable for w, otherwise 0. Hence, a holds the applicable edit tree information for w. Finally, we combine the LSTM output h m , feature vector f and applicable tree vector a together for the edit tree classification task as following, where 'softplus' is the activation function f (x) = ln(1 + e x ) and L h , L f , L a and b are the network parameters. Next, l is passed through the softmax layer to get the output labels for w.
To pick the maximum probable edit tree for an input word, we exploit the prior information about applicable classes. Let o = (o 1 , . . . , o k ) be the output of the softmax layer. The particular edit tree t j ∈ T is considered as the right candidate, where In this way, we choose the maximum probable class over the applicable classes only.  of LSTM is set to 64 for all languages. We apply online learning in our model. Number of epochs and the dropout rate are set to 150 and 0.2 respectively. We use 'Adagrad' (Duchi et al., 2011) optimization algorithm for training. Categorical crossentropy function is used to measure the loss in our model.

Results
As stated in section 1, our method overperforms the baseline system on 15 out of the 52 languages in high resource configuration for the test sets. Whereas, in medium and low resource situations separately, it beats the baseline on 5 and 4 languages respectively. We provide these results in Table 1. The results show that the proposed method is resource intensive.
We also provide our model's performance on the development datasets in Table 2. The results are quite similar to the results given in Table 1. When the training size is high, the proposed model beats the baseline on 15 languages. For medium and low resource scenario, it achieves over the baseline on 4 languages only.