Single-Model Encoder-Decoder with Explicit Morphological Representation for Reinflection

Morphological reinflection is the task of generating a target form given a source form, a source tag and a target tag. We propose a new way of modeling this task with neural encoder-decoder models. Our approach reduces the amount of required training data for this architecture and achieves state-of-the-art results, making encoder-decoder models applicable to morphological reinflection even for low-resource languages. We further present a new automatic correction method for the outputs based on edit trees.


Introduction
Morphological analysis and generation of previously unseen word forms is a fundamental problem in many areas of natural language processing (NLP). Its accuracy is crucial for the success of downstream tasks like machine translation and question answering. Accordingly, learning morphological inflection patterns from labeled data is an important challenge.
The task of morphological reinflection (MRI) consists of producing an inflected form for a given source form, source tag and target tag. A special case is morphological inflection (MI), the task of finding an inflected form for a given lemma and target tag. An English example is "tree"+PLURAL → "trees". Prior work on MI and MRI includes machine learning models and models that exploit the paradigm structure of the language (Ahlberg et al., 2015;Dreyer, 2011;Nicolai et al., 2015).
In this work, we propose the neural encoderdecoder MED -Morphological Encoder-Decoder -a character-level sequence-to-sequence attention model that is a language-independent solution for MRI. In contrast to prior work, we train a single model that is trained on all source to target mappings of the language that are attested in the training set. This radically reduces the amount of training data needed for the encoder-decoder because most MRI patterns occur in many source-target tag pairs. In our model design, what is learned for one pair can be transferred to others.
The key enabler for this single-model approach is a novel representation we use for MRI. We encode the input as a single sequence of (i) the morphological tags of the source form, (ii) the morphological tags of the target form and (iii) the sequence of letters of the source form. The output is the sequence of letters of the target form. As the decoder produces each letter, the attention mechanism can focus on the input letter sequence for parts of the output that simply copy the input. For other parts of the output, e.g., an inflectional ending that is predicted using the target tags, the attention mechanism can focus on the target morphological tags. In more complex cases, simultaneous attention can be paid to subsequences of all three input types -source tags, target tags and input letter sequence. We can train a single generic encoder-decoder per language on this representation that can handle all tag pairs, thus making it possible to make efficient use of the available training data. MED outperformed other systems on the SIGMORPHON16 shared task 1 for all ten languages that were covered (Kann and Schütze, 2016;Cotterell et al., 2016).
We also present POET -Prefer Observed Edit Trees -a new generic method for correcting the output of an MRI system. The combination of MED and POET is state-of-the-art or close to it on a CELEX-based evaluation of MRI even though this evaluation makes it difficult to exploit gener-alizations across tag pairs.

Model Description
Neural network model. Our model is based on the network architecture proposed by  for machine translation. 2 They describe the model in detail; unless we explicitly say so in the description of our model below, we use the same network configuration as . 's model is an extension of the recurrent neural network (RNN) encoderdecoder developed by  and Sutskever et al. (2014). The encoder of the latter consists of an RNN that reads an input sequence of vectors x and encodes it into a fixed-length context vector c, computing hidden states h t and c by with nonlinear functions f and q. The decoder is trained to predict each output y t dependent on c and previous predictions y 1 , ..., y t−1 : with y = (y 1 , ..., y Ty ) and each conditional probability being modeled with an RNN as p(y t |{y 1 , ..., y t−1 }, c) = g(y t−1 , s t , c) (3) where g is a nonlinear function and s t is the hidden state of the RNN.  proposed an attentionbased extension of this model that allows different vectors c t for each step by automatic learning of an alignment model. Additionally, they made the encoder bidirectional: each hidden state h j at time step j does not only depend on the preceding, but also on the following input: The formula for p(y) changes as follows: with s t being an RNN hidden state for time t and c t being the weighted sum of the annotations (h 1 , ..., h Tx ) produced by the encoder, using the attention weights. Further descriptions can be found in ). The final model is a multilayer network with a single maxout (Goodfellow et al., 2013) hidden layer that computes the conditional probability of each element in the output sequence (a letter in our case, (Pascanu et al., 2014)). As MRI is less complex than machine translation, we reduce the number of hidden units and embedding size. After initial experiments, we fixed the hyperparameters of our system and did not further adapt them to a specific task or language. Encoder and decoder RNNs have 100 hidden units each. For training, we use stochastic gradient descent, Adadelta (Zeiler, 2012) and a minibatch size of 20. We initialize all weights in the encoder, decoder and the embeddings except for the GRU weights in the decoder with the identity matrix as well as all biases with zero (Le et al., 2015). We train all models for 20,000 iterations. We settled on this number in early experimentation because training usually converged before that limit.
MED is an ensemble of five RNN encoderdecoders. The final decision is made by majority voting. In case of a tie, the answer is chosen randomly among the most frequent predictions.
Input and output format. We define the alphabet Σ lang as the set of characters used in the application language. As each morphological tag consists of one or more subtags, e.g. "number" or "case", we further define Σ src and Σ trg as the set of morphological subtags seen during training as part of the source tag and target tag, respectively. Let S start and S end be predefined start and end symbols. Then each input of our system is of the format S start Σ src + Σ trg + Σ lang + S end . In the same way, we define the output format as S start Σ lang + S end .
OUT=num=PL i s o l i e r t e r </w>. The system should produce the corresponding output <w> i s o l i e r t e </w>. The high-level structure of MED can be seen in Figure 1.
POET. We now describe POET (Prefer Observed Edit Trees), a new generic method for correcting the output of an MRI system. We use it in combination with MED in this paper, but it can in : Edit tree for the inflected form abgesagt "canceled" and its lemma absagen "to cancel". The highest node contains the length of the parts before and after the LCS. The left node in the second row contains the length of the parts before and after the LCS of abge and ab. The prefix sub indicates that the node is a substitution operation.
principle be applied to any MRI system.
An edit tree e(σ, τ ) specifies a transformation from a source string σ to a target string τ (Chrupała, 2008). To compute e(σ, τ ), we first determine the longest common substring (LCS) (Gusfield, 1997) between σ and τ and then recursively model the prefix and suffix pairs of the LCS. If the length of LCS is zero for (σ, τ ), then e(σ, τ ) is simply the substitution operation that replaces σ with τ . Figure 2 shows an example.
Let X be a training set for MRI. For each pair (s, t) of tags, we define: where S(x) and T (x) are source and target tags of x and e(x) is e(σ(x), τ (x)), the edit tree that transforms the source form into the target form.
Let ρ be a target form predicted by the MRI system for the source form σ and let s and t be source and target tags. POET does not change ρ if e(σ, ρ) ∈ E s,t . Otherwise it replaces ρ with τ : where |ρ, τ | is the Levenshtein distance. If there are several forms τ with edit distance 1, we select the one with the most frequent edit tree. Ties are broken randomly. We observed that MED sometimes makes errors that are close to the target, but differ by one edit operation. Those errors are often not covered by edit trees that are observed in the training data whereas the correct form is. Thus, substituting a form not supported by an observed edit tree with a close one that is supported promises to reduce the error rate.
The effectiveness of POET depends on a training set that is large enough to cover the possible edit trees that can occur in reinflection in a language. Thus, if the training set is not large enough in this respect, then POET will not be beneficial.

Experiments
We compare MED with the three models of Dreyer et al. (2008) as well as with two recently proposed models: (i) discriminative string transduction (Durrett and DeNero, 2013;Nicolai et al., 2015), the SIGMORPHON16 baseline, and (ii) Faruqui et al. (2015)'s encoder-decoder model. 3 We call the latter MODEL*TAG as it requires training as many models as there are target tags.
We evaluate MED on two MRI tasks: CELEX and SIGMORPHON16.
CELEX. This task is based on complete inflection tables for German extracted from CELEX. For this experiment we follow Dreyer et al. (2008). We use four pairs of morphological tags and corresponding word forms from the German part of the CELEX morphological database. The 4 different transduction tasks are: 13SIA → 13SKE, 2PIE → 13PKE, 2PKE → z and rP → pA. 4 An example for this task would be to produce the output gesteuert (target tag pA) for the source steuert (source tag rP). To do so, the system has to learn that the prefix ge-, which is used for many participles in German, has to be added to the beginning of the original word form.
We use the same data splits as Dreyer et al. (2008), dividing the original 2500 samples for each tag into five folds, each consisting of 500 training and 1000 development and 1000 test samples. We train a separate model for each fold and report exact match accuracy, averaged over the five folds, as our final result. SIGMORPHON16. This task covers eight languages and does not provide complete paradigms, but only a set of quadruples, each consisting of word form, source tag, target tag and target form. The main difference to CELEX is that the number of tag pairs is large, resulting in much less training data per tag pair. The number of tag pairs varies by language with Georgian being an extreme case; it has 28 tag pairs in dev that appear less than 10 times in train. For each language, we have around 12,800 training and 1600 development samples. We report exact match accuracy on the development set, as the final test data of the shared task is not publically available yet. Table 1 gives CELEX results. MED+POET is better than prior work on one task, close in performance on two and worse by a small amount on the third. Unlike Dreyer et al. (2008)'s models, MED does not use any hand-crafted features. MED's results are weakest on 13SIA. Typical errors on this task include epenthesis (e.g., zirkle vs. zirkele) and irregular verbs (e.g., abhing vs. abhängte).

Results
For SIGMORPHON16, Table 2 shows that MED outperforms the baseline for all eight languages. Absolute performance and variance is probably influenced by type of morphology (e.g., templatic vs. agglutinative), regularity of the language, number of different tag pairs and other factors. MED performs well even for complex and diverse languages like Arabic, Finnish, Navajo and Turkish, suggesting that the type of attentionbased encoder-decoder we use -single-model, using an explicit morphological representation -is a good choice for MRI.  We do not compare to MODEL*TAG here because it requires training a large number of individual networks. This is a disadvantage compared to MED both in terms of the number of models that need to be trained and in terms of the effective use of the small number of training examples that are available per tag pair.
POET improves the results for all tag pairs for CELEX. However, initial experiments indicated that it is not effective for SIGMORPHON16 because its training sets are not large enough.

Analysis
The main innovation of our work is that MED learns a single model of all MRI patterns of a language and thus can transfer what it has learned from one tag pair to another tag pair. Using CELEX, we now analyze how much our design contributes to better performance by conducting two experiments in which we gradually decrease the training set in two different ways. (i) Large general training set. We only reduce the number of training examples available for a tag pair (s, t) and retain all other training examples. (ii) Small training set. We reduce the number of training examples available for all tag pairs, not just for one.
A typical example of the large general training set scenario is that familiar second person forms are rare in genres like encyclopedia and news. So a training set derived from these genres will be large, but it will have very few tag pairs whose target tag is familiar second person.
A typical example of the small training set scenario is that we are dealing with a low-resource language. In the following two experiments, we only reduce the training set and do not change the test set.
Large general training set. We iteratively halve the training data for 2PIE → 13PKE until only 6.25% or 32 samples are left. Figure 3 shows that MED performs well even if only 6.25% of the training examples for the tag pair remain. In contrast, MODEL*TAG struggles to generalize correctly. This is due to the fact that we train one single model for all tags, so it can learn from other tags and transfer what it has learned to the tag pair that has a small training set.
Small training set. Figure 4 shows results for reducing the training data equally for all tags. MED performs much better than the baseline for less than 50% of the training data. This can be explained by the fact that MED learns from all given data at once and thus is able to learn common patterns that apply across different tag pairs.
In the last years, RNN encoder-decoder models and RNNs in general were applied to several NLP tasks. For example, they proved to be useful for machine translation Sutskever et al., 2014;, parsing (Vinyals et al., 2015) and speech recognition (Graves and Schmidhuber, 2005;Graves et al., 2013).
MED bears some resemblance to Faruqui et al. (2015)'s work. However, they train one network for every tag pair; this can negatively impact performance for low-resource languages and in general when training data are limited. In contrast, we train a single model for each language. This radically reduces the amount of training data needed for the encoder-decoder because most MRI patterns occur in many tag pairs, so what is learned for one can be transferred to others. To be able to model all tag pairs of the language together, we introduce an explicit morphological representation that enables the attention mechanism of the encoder-decoder to generalize MRI patterns across tag pairs.

Conclusion and Future Work
We have presented MED, a language independent neural sequence-to-sequence mapping approach, and POET, a method based on edit trees for correcting the output of an MRI system. MED obtains results comparable to state-of-the-art systems for CELEX and establishes the state-of-the-art for SIGMORPHON16. POET improves results further for large training sets. Our analysis showed that MED outperforms a neural encoder-decoder baseline system by a large margin, especially for small training sets.
In future work, we would like to make POET less dependent on the source tag and thus increase its accuracy for small training sets. Second, we will look into ways of taking advantage of additional information sources including unlabeled corpora.