Morphological Inflection Generation with Hard Monotonic Attention

We present a neural model for morphological inflection generation which employs a hard attention mechanism, inspired by the nearly-monotonic alignment commonly found between the characters in a word and the characters in its inflection. We evaluate the model on three previously studied morphological inflection generation datasets and show that it provides state of the art results in various setups compared to previous neural and non-neural approaches. Finally we present an analysis of the continuous representations learned by both the hard and soft (Bahdanau, 2014) attention models for the task, shedding some light on the features such models extract.

The task is important for many down-stream NLP tasks such as machine translation, especially for dealing with data sparsity in morphologically rich languages where a lemma can be inflected into many different word forms. Several studies have shown that translating into lemmas in the target language and then applying inflection generation as a post-processing step is beneficial for phrase-based machine translation (Minkov et al., 2007;Toutanova et al., 2008;Clifton and Sarkar, 2011;Fraser et al., 2012;Chahuneau et al., 2013) and more recently for neural machine translation (García-Martínez et al., 2016).
The task was traditionally tackled with hand engineered finite state transducers (FST) (Koskenniemi, 1983;Kaplan and Kay, 1994) which rely on expert knowledge, or using trainable weighted finite state transducers (Mohri et al., 1997;Eisner, 2002) which combine expert knowledge with datadriven parameter tuning. Many other machinelearning based methods (Yarowsky and Wicentowski, 2000;Dreyer and Eisner, 2011;Durrett and DeNero, 2013;Hulden et al., 2014;Ahlberg et al., 2015;Nicolai et al., 2015) were proposed for the task, although with specific assumptions about the set of possible processes that are needed to create the output sequence. More recently, the task was modeled as neural sequence-to-sequence learning over character sequences with impressive results (Faruqui et al., 2016). The vanilla encoder-decoder models as used by Faruqui et al. compress the input sequence to a single, fixed-sized continuous representation. Instead, the soft-attention based sequence to sequence learning paradigm  allows directly conditioning on the entire input sequence representation, and was utilized for morphological inflection generation with great success (Kann and Schütze, 2016b,a).
However, the neural sequence-to-sequence models require large training sets in order to perform well: their performance on the relatively small CELEX dataset is inferior to the latent variable WFST model of Dreyer et al. (2008). Interestingly, the neural WFST model by Rastogi et al. (2016) also suffered from the same issue on the CELEX dataset, and surpassed the latent variable model only when given twice as much data to train on.
We propose a model which handles the above issues by directly modeling an almost monotonic alignment between the input and output character sequences, which is commonly found in the morphological inflection generation task (e.g. in languages with concatenative morphology). The model consists of an encoder-decoder neural network with a dedicated control mechanism: in each step, the model attends to a single input state and either writes a symbol to the output sequence or advances the attention pointer to the next state from the bi-directionally encoded sequence, as described visually in Figure 1.
This modeling suits the natural monotonic alignment between the input and output, as the network learns to attend to the relevant inputs before writing the output which they are aligned to. The encoder is a bi-directional RNN, where each character in the input word is represented using a concatenation of a forward RNN and a backward RNN states over the word's characters. The combination of the bi-directional encoder and the controllable hard attention mechanism enables to condition the output on the entire input sequence. Moreover, since each character representation is aware of the neighboring characters, nonmonotone relations are also captured, which is important in cases where segments in the output word are a result of long range dependencies in the input word. The recurrent nature of the decoder, together with a dedicated feedback connection that passes the last prediction to the next decoder step explicitly, enables the model to also condition the current output on all the previous outputs at each prediction step.
The hard attention mechanism allows the network to jointly align and transduce while using a focused representation at each step, rather then the weighted sum of representations used in the soft attention model. This makes our model Resolution Preserving (Kalchbrenner et al., 2016) while also keeping decoding time linear in the output sequence length rather than multiplicative in the input and output lengths as in the softattention model. In contrast to previous sequenceto-sequence work, we do not require the training procedure to also learn the alignment. Instead, we use a simple training procedure which relies on independently learned character-level alignments, from which we derive gold transduction+control sequences. The network can then be trained using straightforward cross-entropy loss.
To evaluate our model, we perform extensive experiments on three previously studied morphological inflection generation datasets: the CELEX dataset (Baayen et al., 1993), the Wiktionary dataset (Durrett and DeNero, 2013) and the SIG-MORPHON2016 dataset . We show that while our model is on par with or better than the previous neural and non-neural state-of-the-art approaches, it also performs significantly better with very small training sets, being the first neural model to surpass the performance of the weighted FST model with latent variables which was specifically tailored for the task by Dreyer et al. (2008). Finally, we analyze and compare our model and the soft attention model, showing how they function very similarly with respect to the alignments and representations they learn, in spite of our model being much simpler. This analysis also sheds light on the representations such models learn for the morphological inflection generation task, showing how they encode specific features like a symbol's type and the symbol's location in a sequence.
To summarize, our contributions in this paper are three-fold: 1. We present a hard attention model for nearlymonotonic sequence to sequence learning, as common in the morphological inflection setting.
2. We evaluate the model on the task of morphological inflection generation, establishing a new state of the art on three previouslystudied datasets for the task.
3. We perform an analysis and comparison of our model and the soft-attention model, shedding light on the features such models extract for the inflection generation task.

Motivation
We would like to transduce an input sequence, x 1:n ∈ Σ * x into an output sequence, y 1:m ∈ Σ * y , where Σ x and Σ y are the input and output vocabularies, respectively. Imagine a machine with read-only random access to the encoding of the input sequence, and a single pointer that determines the current read location. We can then model sequence transduction as a series of pointer movement and write operations. If we assume the alignment is monotone, the machine can be simpli-fied: the memory can be read in sequential order, where the pointer movement is controlled by a single "move forward" operation (step) which we add to the output vocabulary. We implement this behavior using an encoder-decoder neural network, with a control mechanism which determines in each step of the decoder whether to write an output symbol or promote the attention pointer the next element of the encoded input.

Model Definition
In prediction time, we seek the output sequence y 1:m ∈ Σ * y , for which: Where x ∈ Σ * x is the input sequence and f = {f 1 , . . . , f l } is a set of attributes influencing the transduction task (in the inflection generation task these would be the desired morpho-syntactic attributes of the output sequence). Given a nearlymonotonic alignment between the input and the output, we replace the search for a sequence of letters with a sequence of actions s 1:q ∈ Σ * s , where Σ s = Σ y ∪ {step}. This sequence is a series of step and write actions required to go from x 1:n to y 1:m according to the monotonic alignment between them (we will elaborate on the deterministic process of getting s 1:q from a monotonic alignment between x 1:n to y 1:m in section 2.4). In this case we define: 1 which we can estimate using a neural network: where the network's parameters Θ are learned using a set of training examples. We will now describe the network architecture. A round tip expresses concatenation of the inputs it receives. The attention is promoted to the next input element once a step action is predicted.

Network Architecture
Notation We use bold letters for vectors and matrices. We treat LSTM as a parameterized function LSTM θ (x 1 . . . x n ) mapping a sequence of input vectors x 1 . . . x n to a an output vector h n . The equations for the LSTM variant we use are detailed in the supplementary material of this paper.
Encoder For every element in the input sequence: x 1:n = x 1 . . . x n , we take the corresponding embedding: e x 1 . . . e xn , where: e x i ∈ R E . These embeddings are parameters of the model which will be learned during training. We then feed the embeddings into a bi-directional LSTM encoder (Graves and Schmidhuber, 2005) which results in a sequence of vectors: , the forward LSTM and the backward LSTM outputs when fed with e x i . Decoder Once the input sequence is encoded, we feed the decoder RNN, LSTM dec , with three inputs at each step: 1. The current attended input, x a ∈ R 2H , initialized with the first element of the encoded sequence, x 1 .

2.
A set of embeddings for the attributes that influence the generation process, concatenated to a single vector: 3. s i−1 ∈ R E , which is an embedding for the predicted output symbol in the previous decoder step.
Those three inputs are concatenated into a single vector z i = [x a , f , s i−1 ] ∈ R 2H+F ·l+E , which is fed into the decoder, providing the decoder output vector: LSTM dec (z 1 . . . z i ) ∈ R H . Finally, to model the distribution over the possible actions, we project the decoder output to a vector of |Σ s | elements, followed by a softmax layer: Control Mechanism When the most probable action is step, the attention is promoted so x a contains the next encoded input representation to be used in the next step of the decoder. The process is demonstrated visually in Figure 1.

Training the Model
For every example: (x 1:n , y 1:m , f ) in the training data, we should produce a sequence of step and write actions s 1:q to be predicted by the decoder. The sequence is dependent on the alignment between the input and the output: ideally, the network will attend to all the input characters aligned to an output character before writing it. While recent work in sequence transduction advocate jointly training the alignment and the decoding mechanisms Yu et al., 2016), we instead show that in our case it is worthwhile to decouple these stages and learn a hard alignment beforehand, using it to guide the training of the encoder-decoder network and enabling the use of correct alignments for the attention mechanism from the beginning of the network training phase. Thus, our training procedure consists of three stages: learning hard alignments, deriving oracle actions from the alignments, and learning a neural transduction model given the oracle actions.
Learning Hard Alignments We use the character alignment model of Sudoh et al. (2013), based on a Chinese Restaurant Process which weights single alignments (character-to-character) in proportion to how many times such an alignment has been seen elsewhere out of all possible alignments. The aligner implementation we used produces either 0to-1, 1-to-0 or 1-to-1 alignments.
Deriving Oracle Actions We infer the sequence of actions s 1:q from the alignments by the deterministic procedure described in Algorithm 1. An example of an alignment with the resulting oracle action sequence is shown in Figure 2, where a 4 is a 0-to-1 alignment and the rest are 1-to-1 alignments. Figure 2: Top: an alignment between a lemma x 1:n and an inflection y 1:m as predicted by the aligner. Bottom: s 1:q , the sequence of actions to be predicted by the network, as produced by Algorithm 1 for the given alignment.
Algorithm 1 Generates the oracle action sequence s 1:q from the alignment between x 1:n and y 1:m Require: a, the list of either 1-to-1, 1-to-0 or 0to-1 alignments between x 1:n and y 1:m 1: Initialize s as an empty sequence 2: for each if a i+1 is not a 0-to-1 alignment then 8: s.append(step) return s This procedure makes sure that all the source input elements aligned to an output element are read (using the step action) before writing it. Learning a Neural Transduction Model The network is trained to mimic the actions of the oracle, and at inference time, it will predict the actions according to the input. We train it using a conventional cross-entropy loss function per example: Transition System An alternative view of our model is that of a transition system with AD-VANCE and WRITE(CH) actions, where the oracle is derived from a given hard alignment, the input is encoded using a biRNN, and the next action is determined by an RNN over the previous inputs and actions.

Experiments
We perform extensive experiments with three previously studied morphological inflection generation datasets to evaluate our hard attention model in various settings. In all experiments we compare our hard attention model to the best performing neural and non-neural models which were previously published on those datasets, and to our implementation of the global (soft) attention model presented by Luong et al. (2015) which we train with identical hyper-parameters as our hardattention model. The implementation details for our models are described in the supplementary material section of this paper. The source code and data for our models is available on github. 2 CELEX Our first evaluation is on a very small dataset, to see if our model indeed avoids the tendency to overfit with small training sets. We report exact match accuracy on the German inflection generation dataset compiled by Dreyer et al. (2008)  Wiktionary To neutralize the negative effect of very small training sets on the performance of the different learning approaches, we also evaluate our model on the dataset created by Durrett and DeNero (2013), which contains up to 360k training examples per language. It was built by extracting Finnish, German and Spanish inflection tables from Wiktionary, used in order to evaluate their system based on string alignments and a semi-CRF sequence classifier with linguistically inspired features, which we use a baseline. We also used the dataset expansion made by Nicolai et al. (2015) to include French and Dutch inflections as well. Their system also performs an alignand-transduce approach, extracting rules from the aligned training set and applying them in inference time with a proprietary character sequence classifier. In addition to those systems we also compare to the results of the recent neural approach of Faruqui et al. (2016), which did not use an attention mechanism, and Yu et al. (2016), which coupled the alignment and transduction tasks.
SIGMORPHON As different languages show different morphological phenomena, we also experiment with how our model copes with these various phenomena using the morphological inflection dataset from the SIGMORPHON2016 shared task . Here the training data consists of ten languages, with five morphological system types (detailed in Table 3): Russian (RU), German (DE), Spanish (ES), Georgian (GE), Finnish (FI), Turkish (TU), Arabic (AR), Navajo (NA), Hungarian (HU) and Maltese (MA) with roughly 12,800 training and 1600 development examples per language. We compare our model to two soft attention baselines on this dataset: MED (Kann and Schütze, 2016b), which was the best participating system in the shared task, and our implementation of the global (soft) attention model presented by Luong et al. (2015).

Results
In all experiments, for both the hard and soft attention models we implemented, we report results using an ensemble of 5 models with different random initializations by using majority voting on the final sequences the models predicted, as proposed by Kann and Schütze (2016a). This was done to perform fair comparison to the models of Kann and Schütze (2016a,b); Faruqui et al. (2016) which we compare to, that also perform a similar ensem-   On the low resource setting (CELEX), our hard attention model significantly outperforms both the recent neural models of Kann and Schütze (2016a) (MED) and Rastogi et al. (2016) (NWFST) and the morphologically aware latent variable model of Dreyer et al. (2008) (LAT), as detailed in Table  1. In addition, it significantly outperforms our implementation of the soft attention model (Soft). It is also, to our knowledge, the first model that surpassed in overall accuracy the latent variable model on this dataset. We attribute our advantage over the soft attention models to the ability of the hard attention control mechanism to harness the monotonic alignments found in the data. The advantage over the FST models may be explained by our conditioning on the entire output history which is not available in those models. Figure 3 plots the train-set and dev-set accuracies of the soft and hard attention models as a function of the training epoch. While both models perform similarly on the train-set (with the soft attention model fitting it slightly faster), the hard attention model performs significantly better on the dev-set. This shows the soft attention model's tendency to overfit on the small dataset, as it is not enforcing the monotonic assumption of the hard attention model.
On the large training set experiments (Wiktionary), our model is the best performing model on German verbs, Finnish nouns/adjectives and Dutch verbs, resulting in the highest reported average accuracy across all inflection types when compared to the four previous neural and nonneural state of the art baselines, as detailed in Table 2. This shows the robustness of our model also with large amounts of training examples, and the advantage the hard attention mechanism provides over the encoder-decoder approach of Faruqui et al. (2016) which does not employ an attention mechanism. Our model is also significantly more accurate than the model of Yu et al. (2016), which shows the advantage of using independently learned alignments to guide the network's attention from the beginning of the training process. While our soft-attention implementation outperformed the models of Yu et al. (2016) and Durrett and DeNero (2013), it still performed inferiorly to the hard attention model. Table 3 better than both soft-attention baselines for the suffixing+stem-change languages (Russian, German and Spanish) and is slightly less accurate than our implementation of the soft attention model on the rest of the languages, which is now the best performing model on this dataset to our knowledge. We explain this by looking at the languages from a linguistic typology point of view, as detailed in . Since Russian, German and Spanish employ a suffixing morphology with internal stem changes, they are more suitable for monotonic alignment as the transformations they need to model are the addition of suffixes and changing characters in the stem. The rest of the languages in the dataset employ more context sensitive morphological phenomena like vowel harmony and consonant harmony, which require to model long range dependencies in the input sequence which better suits the soft attention mechanism. While our implementation of the soft attention model and MED are very similar modelwise, we hypothesize that our soft attention model results are better due to the fact that we trained the model for 100 epochs and picked the best performing model on the development set, while the MED system was trained for a fixed amount of 20 epochs (although trained on more data -both train and development sets).

Analysis
The Learned Alignments In order to see if the alignments predicted by our model fit the mono-tonic alignment structure found in the data, and whether are they more suitable for the task when compared to the alignments found by the soft attention model, we examined alignment predictions of the two models on examples from the development portion of the CELEX dataset, as depicted in Figure 4. First, we notice the alignments found by the soft attention model are also monotonic, supporting our modeling approach for the task. Figure 4 (bottom-right) also shows how the hardattention model performs deletion (legte→lege) by predicting a sequence of two step operations. Another notable morphological transformation is the one-to-many alignment, found in the top example: flog→fliege, where the model needs to transform a character in the input, o, to two characters in the output, ie. This is performed by two consecutive write operations after the step operation of the relevant character to be replaced. Notice that in this case, the soft attention model performs a different alignment by aligning the character i to o and the character g to the sequence eg, which is not the expected alignment in this case from a linguistic point of view.
The Learned Representations How does the soft-attention model manage to learn nearlyperfect monotonic alignments? Perhaps the the network learns to encode the sequential position as part of its encoding of an input element? More generally, what information is encoded by the soft and hard alignment encoders? We selected 500 random encoded characters-in-context from input  Since those vectors are outputs from the bi-LSTM encoders of the models, every vector of this form carries information of the specific character with its entire context. We project these encodings into 2-D using SVD and plot them twice, each time using a different coloring scheme. We first color each point according to the character it represents (Figures 5a, 5b). In the second coloring scheme (Figures 5c, 5d), each point is colored according to the character's sequential position in the word it came from, blue indicating positions near the beginning of the word, and red positions near its end. While both models tend to cluster representations for similar characters together (Figures 5a,  5b), the hard attention model tends to have much more isolated character clusters. Figures 5c, 5d show that both models also tend to learn representations which are sensitive to the position of the character, although it seems that here the soft attention model is more sensitive to this information as its coloring forms a nearly-perfect red-to-blue transition on the X axis. This may be explained by the soft-attention mechanism encouraging the encoder to encode positional information in the input representations, which may help it to predict better attention scores, and to avoid collisions when computing the weighted sum of representations for the context vector. In contrast, our hardattention model has other means of obtaining the position information in the decoder using the step actions, and for that reason it does not encode it as strongly in the representations of the inputs. This behavior may allow it to perform well even with fewer examples, as the location information is represented more explicitly in the model using the step actions.

Related Work
Many previous works on inflection generation used machine learning methods (Yarowsky and Wicentowski, 2000;Dreyer and Eisner, 2011;Durrett and DeNero, 2013;Hulden et al., 2014;Ahlberg et al., 2015;Nicolai et al., 2015) with assumptions about the set of possible processes needed to create the output word. Our work was mainly inspired by Faruqui et al. (2016) which trained an independent encoder-decoder neural network for every inflection type in the training data, alleviating the need for feature engineering. Kann and Schütze (2016b,a) tackled the task with a single soft attention model  for all inflection types, which resulted in the best submission at the SIGMORPHON 2016 shared task . In another closely related work, Rastogi et al. (2016) model the task with a WFST in which the arc weights are learned by optimizing a global loss function over all the possible paths in the state graph, while modeling contextual features with bi-directional LSTMS. This is similar to our approach, where instead of learning to mimic a single greedy alignment as we do, they sum over all possible alignments. While not committing to a single greedy alignment could in theory be beneficial, we see in Table 1 that-at least for the low resource scenario-our greedy approach is more effective in practice. Another recent work (Kann et al., 2016) proposed performing neural multi-source morphological reinflection, generating an inflection from several source forms of a word.
Previous works on neural sequence transduction include the RNN Transducer (Graves, 2012) which uses two independent RNN's over monotonically aligned sequences to predict a distribution over the possible output symbols in each step, including a null symbol to model the alignment. Yu et al. (2016) improved this by replacing the null symbol with a dedicated learned transition probability. Both models are trained using a forwardbackward approach, marginalizing over all possible alignments. Our model differs from the above by learning the alignments independently, thus enabling a dependency between the encoder and decoder. While providing better results than Yu et al. (2016), this also simplifies the model training using a simple cross-entropy loss. A recent work by Raffel et al. (2017) jointly learns the hard monotonic alignments and transduction while maintaining the dependency between the encoder and the decoder. Jaitly et al. (2015) proposed the Neural Transducer model, which is also trained on external alignments. They divide the input into blocks of a constant size and perform soft attention separately on each block. Lu et al. (2016) used a combination of an RNN encoder with a CRF layer to model the dependencies in the output sequence. An interesting comparison between "traditional" sequence transduction models (Bisani and Ney, 2008;Jiampojamarn et al., 2010;Novak et al., 2012) and encoder-decoder neural networks for monotone string transduction tasks was done by Schnober et al. (2016), showing that in many cases there is no clear advantage to one approach over the other.
Regarding task-specific improvements to the attention mechanism, a line of work on attentionbased speech recognition (Chorowski et al., 2015;Bahdanau et al., 2016) proposed adding location awareness by using the previous attention weights when computing the next ones, and preventing the model from attending on too many or too few inputs using "sharpening" and "smoothing" techniques on the attention weight distributions. Cohn et al. (2016) offered several changes to the attention score computation to encourage wellknown modeling biases found in traditional machine translation models like word fertility, position and alignment symmetry. Regarding the utilization of independent alignment models for training attention-based networks, Mi et al. (2016) showed that the distance between the attentioninfused alignments and the ones learned by an independent alignment model can be added to the networks' training objective, resulting in an improved translation and alignment quality.

Conclusion
We presented a hard attention model for morphological inflection generation. The model employs an explicit alignment which is used to train a neural network to perform transduction by decoding with a hard attention mechanism. Our model performs better than previous neural and non-neural approaches on various morphological inflection generation datasets, while staying competitive with dedicated models even with very few training examples. It is also computationally appealing as it enables linear time decoding while staying resolution preserving, i.e. not requiring to compress the input sequence to a single fixedsized vector. Future work may include applying our model to other nearly-monotonic alignand-transduce tasks like abstractive summarization, transliteration or machine translation.