IIT(BHU)–IIITH at CoNLL–SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection

This paper describes the systems submitted by IIT (BHU), Varanasi/IIIT Hyderabad (IITBHU–IIITH) for Task 1 of CoNLL– SIGMORPHON 2018 Shared Task on Universal Morphological Reinﬂection (Cotterell et al., 2018). The task is to generate the in-ﬂected form given a lemma and set of morphological features. The systems are evaluated on over 100 distinct languages and three different resource settings (low, medium and high). We formulate the task as a sequence to sequence learning problem. As most of the characters in inﬂected form are copied from the lemma, we use Pointer-Generator Network (See et al., 2017) which makes it easier for the system to copy characters from the lemma. Pointer-Generator Network also helps in dealing with out-of-vocabulary characters during inference. Our best performing system stood 4th among 28 systems, 3rd among 23 systems and 4th among 23 systems for the low, medium and high resource setting respectively.


Introduction
Morphological Inflection is the process of inflecting a lemma according to a set of morphological features so that the lemma becomes in accordance with other words in the sentence. It is useful for alleviating data sparsity, especially in morphologically rich languages during Natural Language Generation. For example, Minkov et al. (2007) translate words from the source language to lemmas in the target language and then use Morphological Inflection as a post-processing step to make the words of the output sentence in agreement with each other. Not only their approach reduces the data sparsity by decreasing the number of candidate words while translating, it also gives better results. * This research was conducted during the authors internship at IIIT Hyderabad.
CoNLL-SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection consisted of two tasks. Participants could compete in either or both of the tasks. We participated in Task 1 only. The task was to build a system which could inflect a lemma given a set of morphological tags. The systems were evaluated on over 100 distinct languages, out of which 10 were surprise languages. An example showing input and the expected output of the system is given below.
(touch, V;V.PTCP;PRS) → touching To assess the system's ability to generalize in different resource settings, three varying amounts of labeled training data (low, medium, high) were given. The systems were evaluated separately for each language and the three data quantity conditions. Accuracy (the fraction of correctly predicted forms) and the average Levenshtein distance between the prediction and the truth across all predictions were used as metrics. An aggregated performance measure separate for each of the resource setting was obtained by averaging the results for individual languages. Morphological Inflection is accomplished by different morphological processes such as prefixation, infixation, suffixation (attaching bound morpheme in front, within and at the end of stem respectively) and ablaut depending on the language. As the systems were evaluated on over 100 distinct languages, we were motivated to use neural network based approaches because they do not require any manual feature engineering. But neural networks require a lot of training data to work. We try to address this challenge by designing neural network architectures which work well even on the low resource setting of the task.
Our system is based on attention based encoderdecoder models . The Figure 1: Neural network architecture for our system. The two encoders are shown at the top, while the decoder is shown at the bottom. At each time step, the decoder computes attention distribution over both the lemma and the tags separately. Attention mechanism is shown by the dotted lines (darker colour corresponds to more weight). A scalar -generation probability p gen ∈ [0, 1] (shown as the square, the lighter the colour the less the value) is also calculated at each time step, which corresponds to how likely a character will be generated from the vocabulary instead of a character being copied from the lemma. lemma and the tags are encoded using two separate encoders. While decoding, the decoder reads relevant parts of the lemma and the tags using attention mechanism. As most of the characters in the inflected form are copied from the lemma, it is necessary to design a system with strong tendency to copy. We use Pointer-Generator Network (See et al., 2017) which facilitates copying of characters of lemma and tackles the problem of out-ofvocabulary tokens during prediction. Compared to other similar performing systems, our system is trained end-to-end, doesn't require data augmentation techniques and uses soft attention over hard monotonic attention which makes it more flexible. Our best performing system outperforms the baseline by 14.21%, 22.41% and 19.13% for the low, medium and high resource settings respectively. It stood 4th among 28 systems, 3rd among 23 systems and 4th among 23 systems for the low, medium and high data conditions respectively.
The remainder of this paper is organized as follows. We present prior work on Morphological Inflection in Section 2. We describe our system in Section 3. The results of the shared task are presented in Section 4. In Section 5, we present ablation studies and discuss the contribution of the specific design decisions we made to the performance of our systems. We conclude the paper with Section 6.

Background
Traditional approaches for morphological inflection involve crafting hand-engineered rules. Although these rules offer high accuracy, they are very expensive to create.
Machine learning based approaches treat morphological inflection as a string transduction task (Durrett and DeNero, 2013;Hulden et al., 2014;Ahlberg et al., 2015;Nicolai et al., 2015). These approaches extract rules automatically from the data, but they still require language specific feature engineering.
Neural network based approaches successfully solve this problem. These approaches require no feature engineering and the same architecture works for different languages. Faruqui et al. (2016) were the first to formulate morphological inflection as neural sequence to sequence learning problem . Kann and Schütze (2016) improved on their approach by using a single model instead of separate models for each morphological feature. They fed morphological tags into the encoder along with the sequence of characters of lemma. They also used attention mechanism . Aharoni and Goldberg (2017) present an alternative to the soft attention in form of hard monotonic attention which models the almost monotonic alignment between characters in lemma and the inflected form. The best performing system (Makarov et al., 2017) of the previous edition of this shared task extended the hard monotonic attention model of Aharoni and Goldberg (2017) with a copy mechanism (HACM model). To deal with low training data especially in the low and medium resource settings, some teams used data augmentation techniques Bergmanis et al.,

System Description
In this section, we describe our system in detail. We report the neural network architecture, the training process, the hyperparameters and our submissions.

Neural network architecture
Our neural network architecture is based on Pointer-Generator Network (See et al., 2017) with some subtle differences.
Characters of the lemma c i along with the additional start and stop characters are fed one by one into a bidirectional LSTM encoder producing a sequence of hidden states h l i . Similarly, using a separate bidirectional LSTM encoder, the tags tg i are encoded and another sequence of hidden states h tg i is obtained.
We use a unidirectional LSTM as the decoder. The decoder's hidden state s i is initialised by applying an affine transformation on the concatenation of the last hidden states of the lemma and the tag encoders. As the input and output sequences have different semantics, this affine transformation gives the model the ability to learn transformation of semantics from input to output (Faruqui et al., 2016).
While decoding, at each time step t, the decoder computes an attention distribution over the lemma and the tag separately denoted as a t l and a t tg .
The context vectors h * l and h * tg are computed as the weighted sum over the encoder hidden states h l i and h tg i with the attention distribution mass a t l and a t tg as weights.
The combined context vector is obtained by simply concatenating the lemma and the tag context vector.
The combined context vector h * t and the embedding of character predicted at the previous time step, y t−1 (while training to speed up convergence we use the ground truth label y * t−1 instead) is given as input to the decoder. At the first time step, start character is given as input in place of y t−1 .
where f is a nonlinear function. A probability distribution over the characters in the vocabulary is calculated which corresponds to how likely will a particular character be generated (if a character is generated at all).
At each time step, a generation probability p gen ∈ [0, 1] is calculated. The generation probability determines if the decoder will generate a character from the vocabulary or copy a character from the lemma.
Note that here p gen is calculated using y t−1 (the embedding of output produced at the previous time step) instead of the decoder input x t as in See et al. (2017).
The probability of predicting a character c is computed as the sum of probability of generating c weighted by the generation probability p gen and the total attention distribution over c weighted by the probability of copying it (1 − p gen ) .
The decoder keeps predicting characters until the stop character is predicted or a fixed number of time steps are reached.

Training
We use negative log likelihood to compute the loss. The loss for time step t, where c * t is the target character is given by, low medium high embedding size 100 100 300 hidden units 100 100 100 dropout probability (p) 0.5 0.5 0.3 initial epochs (e 1 ) 300 80 60 extended epochs (e 2 ) 100 20 10 The loss for the overall sequence is, We use Adam Optimiser (Kingma and Ba, 2014) with initial learning rate 0.001 and batch size 32 to train the neural network. To deal with exploding gradient problem, we clip the norm computed over all the gradients together to 3. We apply dropout (Srivastava et al., 2014) with probability p over embeddings and the encoder hidden states.
We use early stopping to prevent overfitting. A portion of the development set is used as the validation set. After each epoch, performance on validation set is calculated. Initially the model is trained on e 1 epochs. If the highest performance on validation set is obtained within e 2 recent epochs, the model is further trained for e 2 epochs. This goes on until performance on validation set stops improving.
Single layer LSTMs were used as encoders and decoders to reduce number of parameters. Optimal size of embeddings and the number of hidden units in LSTMs were determined based on the performance of the model on a subset of languages in development set.
The values for hyperparameters p, e 1 , e 2 , embedding size and hidden units of LSTM are given in Table 1.
We used PyTorch for implementing the network. The code for the system is available at https://github.com/abhishek0318/ conll-sigmorphon-2018.

Submissions
We made a total of two submissions. For the first submission, we trained only one system for each language and data resource setting pair. We used ensembling technique for the second submission. We trained 5, 3 and 1 system(s) for each language  in low, medium and high data resource settings respectively. Their predictions were combined using hard voting.

Results
Average accuracy of the system over all the languages in a data resource setting is presented in Table 2. Our best performing system outperforms the baseline by very large margins -14.21%, 22.41% and 19.13% for the low, medium and high resource settings respectively.
We observe that using ensembling technique (in the second submission) gives a boost of few percentage points in the accuracy over the first submission, where ensembling is not used.

Ablation Studies
In this section, we investigate how difference system design choices influenced the performance of the system. As reasonable performances were obtained for medium and high resource settings in previous editions of the shared task, we focus our attention to the low resource setting and compare models on this setting.

Pointer Generator Network
We examine the performance gain obtained by using Pointer-Generator Network, the essence of our system. We compare the performance of a simple attention based neural encoder-decoder model with and without using ideas from Pointer-Generator Network.
Consider the architecture proposed by Kann and Schütze (2016) for the task of morphological inflection. The architecture is based on simple attention based encoder-decoder model. The source sequence s i consists of the characters of the lemma followed by the tags.
We include ideas from Pointer-Generator Network into this model. At each time step, the decoder calculates generation probability p gen (See et al., 2017). The network uses the computed attention distribution to determine which character from the lemma it should copy. Because there is only a single encoder, the attention distribution is over both the lemma and the tags. The tags therefore have some attention over them. To use Equation 12, we must normalise the attention weights of the characters, so that we have a new attention distribution over the set of characters.
We use modified form of Equation 12 as shown above to calculate P (c). Here C is the set of characters.
For the same hyperparameters, the architecture used in Kann and Schütze (2016) gives 21.99% average accuracy as compared to the architecture including ideas from Pointer-Generator Network, which gives 44.02% average accuracy tested on development set over all the languages for low resource setting. Thus using Pointer Generator Network increases the performance of the system tremendously for low resource setting.

Separate Encoder for Tags
We investigate the benefit of using a separate encoder for the tags, instead of encoding them using a same encoder as in Kann and Schütze (2016).
Consider the neural network architecture with two separate encoders for the lemma and the tags. At each timestep while decoding, attention distribution is computed over the lemma. The last hidden state of the tag encoder is used as the representation of the set of tags. It along with the context vector of the lemma is fed to the decoder at each time step. We compare the performance of this architecture, to the architecture described in Section 5.1 (which uses single encoder for the lemma and tags and Pointer-Generator Network). The architecture with a single encoder obtains 44.02% average accuracy, while the one with two separate encoders achieves 48.18% average accuracy tested on the development set for low resource setting.
A possible explanation for the difference in the performance is that the lemma and the tags are completely separate entities and a single encoder can't encode them correctly. We were motivated to represent the tags using embeddings as embeddings have more representational power compared to zeros and ones in case of one hot encoding. As the number of tags vary for each example, using LSTM to encode them seemed apt. Note that the representation obtained using this approach is not order invariant. Using order invariant representations (Vinyals et al., 2016;Zaheer et al., 2017) is left as future work.

Attention over Tags
We inspect whether using attention over the sequence of tags as compared to using a fixed vector representation gives better results. We consider the architecture introduced in Section 5.2. Instead of using last hidden state of the encoder to represent the tags, we use attention over tags too and compare the performance. Note this is same architecture we described in 3.1. Using attention over tags leads to average accuracy of 49.08% as compared to 48.18% on the development set for low resource setting. This can explained as by using attention mechanism, the model doesn't need to compress the information of all the tags into a single vector. It can attend to a specific tag based on the decoder state.

Hierarchical Attention
We investigate if using Hierarchical Attention (Libovický and Helcl, 2017) instead of just concatenating the two context vectors for lemma and the tags as done in Equation 8 proves advantageous. Libovický and Helcl (2017) proposed Hierarchical Attention technique for combining the context vectors in case of multiple source sequences.
After computing the individual context vectors, a scalar a h ∈ [0, 1] is calculated. This scalar corresponds to how the attention should be divided between the lemma and the tag. The combined context vector is obtained by taking the weighted average, as shown above. Compared to the concatenating the context vectors (as done in our submission), using hierarchical attention gives worse results (46.60% average accuracy as compared to 49.08% average accuracy on development set for low resource setting). This is possibly because of the increase in number of parameters to learn and the additional non linearities such as sigmoid and tanh which lead to vanishing gradient problem.

Conclusion
In this paper, we described IITBHU-IIITH system for Task 1 of CoNLL-SIGMORPHON 2018 Shared Task. Our system is one of the top performing systems in this edition of the shared task and beats the baseline by large margins. Even though our approach was completely based on neural networks, our system works very well for low resource setting.
We conclude that neural network architectures with explicit copying mechanism (like Pointer-Generator Network) perform well in Morphological Inflection task even on low resource setting.