Align and Copy: UZH at SIGMORPHON 2017 Shared Task for Morphological Reinflection

This paper presents the submissions by the University of Zurich to the SIGMORPHON 2017 shared task on morphological reinflection. The task is to predict the inflected form given a lemma and a set of morpho-syntactic features. We focus on neural network approaches that can tackle the task in a limited-resource setting. As the transduction of the lemma into the inflected form is dominated by copying over lemma characters, we propose two recurrent neural network architectures with hard monotonic attention that are strong at copying and, yet, substantially different in how they achieve this. The first approach is an encoder-decoder model with a copy mechanism. The second approach is a neural state-transition system over a set of explicit edit actions, including a designated COPY action. We experiment with character alignment and find that naive, greedy alignment consistently produces strong results for some languages. Our best system combination is the overall winner of the SIGMORPHON 2017 Shared Task 1 without external resources. At a setting with 100 training samples, both our approaches, as ensembles of models, outperform the next best competitor.


Introduction
This paper describes our approaches and results for Task 1 (without external resources) of the CoNLL-SIGMORPHON 2017 challenge on Universal Morphological Reinflection (Cotterell et al., 2017). This task consists in generating inflected * These two authors contributed equally. word forms for 52 languages given a lemma and a morphological feature specification (Sylak-Glassman et al., 2015)  There are three task setups: a low setting where training data are only 100 (!) samples, a medium setting with 1K training samples, and a high setting with 10K samples. We consider the problem of tackling morphological inflection generation at a low-resource setting with a neural network approach, which is hard for plain soft-attention encoder-decoder models (Kann and Schütze, 2016a,b). We present two systems that are based on the hard monotonic attention model of Aharoni and Goldberg (2017); Aharoni et al. (2016), which is strong on smaller-sized training datasets. We observe that to excel at a lowresource setting, a model needs to be good at copying lemma characters over to the inflected formby far the most common operation of string transduction in the morphological inflection generation task.
In our first approach, we extend the hard monotonic attention model with a copy mechanism that produces a mixture distribution from the character generation and character copying distributions. This idea is reminiscent of the pointer-generator model of See et al. (2017) and the CopyNet model of Gu et al. (2016).
Our second approach is a neural state-transition system that explicitly learns the copy action and thus does away with character decoding altogether whenever a character needs to be copied over. This approach is inspired by shift-reduce parsing with stack LSTMs (Dyer et al., 2015) and transitionbased named entity recognition (Lample et al., 2016).

Preliminaries
In this section, we formally describe the problem of morphological inflection generation as a string transduction task. Next, we show how this task can be reformulated in terms of transduction actions. Finally, we discuss the string alignment strategies that we use to derive oracle actions.

Morphological inflection generation
Morphological inflection generation is an instance of the more general sequence transduction task, where the goal is to find a mapping of a variable-length sequence x to another variablelength sequence y. Specific to morphological inflection generation is that the input and output vocabularies-lemmas and inflected forms-are the same set of characters of one natural language, i.e. Σ x = Σ y = Σ. Formally, our task is to learn a mapping from an input sequence of characters x 1:n ∈ Σ * (the lemma) to an output sequence of characters y 1:m ∈ Σ * (the inflected form) given a set of morpho-syntactic features f ⊆ Φ, where Φ is the alphabet of morpho-syntactic features for that language.

Task reformulation
To efficiently condition on parts of the input sequence, we use hard monotonic attention, which has been found highly suitable for this task (Aharoni and Goldberg, 2017;Aharoni et al., 2016). With hard attention, at each step, the prediction of an output element is based on attending to only one element from the input sequence as opposed to conditioning on the entire input sequence as in soft attention models.
Hard monotonic attention is motivated by the often monotonic alignment between the lemma characters and the characters of its inflected form: It suffices to only allow for the advancement of the attention pointer up in a sequential order over the elements of the input sequence. Thus, the sequence transduction process can be represented as a sequence of actions a 1:q ∈ Ω * over an input string, where the set of actions Ω includes operations for writing characters and advancing the attention pointer. We can, therefore, reformulate the task of finding a mapping from an input sequence f l o g | | | | | | | f l i e g e n f l o g | | | | | | | f l i e g e n Figure 2: Examples of smart alignment (top) and naive alignment (bottom). In each example, inflected form is at the top, lemma at the bottom.
of lemma characters x ∈ Σ * to the output sequence of actionsâ ∈ Ω * , given a set of morphosyntactic features f ⊆ Φ, such that: We use a recurrent neural network to estimate the probability distribution P in Equation 1 from training data. To derive the sequence of oracle actions from each training sample, we use two different character alignment strategies formally described below.
Smart alignment uses the Chinese Restaurant Process character alignment implementation distributed with the SIGMORPHON 2016 baseline system . 1 This is the aligner of Aharoni and Goldberg (2017).
Naive alignment aligns two sequences p and q, such that the length of p is greater or equal to the length of q, by producing 1-to-1 character alignments until it reaches the end of q, from which point it outputs 1-to-0 alignments (and 0to-1 alignments once it reaches the end of p if |q| > |p|).  with a copy mechanism which adds a soft switch between generating an output symbol from a fixed vocabulary Σ train and copying the currently attended input symbol x i . In this section, we first review the architecture of the hard monotonic model and then present our copy mechanism.

Hard monotonic attention model
The hard monotonic attention model operates over two types of actions: WRITE σ, σ ∈ Σ, for outputting the character σ and STEP which moves forward the attention pointer, i.e. Ω = Σ ∪ {STEP}. At each step, the model either generates an output symbol or starts to attend to the next encoded input character. The system learns to move the attention pointer by outputting a STEP action.
To compute the sequences of oracle actions for each training pair of lemma and its inflected form, Aharoni and Goldberg (2017) apply a deterministic algorithm 2 to the output of the smart aligner.
Architecture The hard monotonic attention model uses a single-layer bidirectional LSTM encoder (Graves and Schmidhuber, 2005) to encode input lemma x 1:n as a sequence of vectors h 1:n , h i ∈ R 2H , where H is the hidden dimension of the LSTM layer. At all time steps t, the model maintains a state s t ∈ R H from which the most probable action a t is predicted. The sequence of states is modeled with a single-layer LSTM that receives, at time t, a concatenated input of: where i is the attention pointer, 2. the concatenated vector of feature embeddings f ∈ R F ·|Φ| , where F is the dimension of the feature embedding layer, 3. the embedding of the previous output action E(a t−1 ) ∈ R E , where E is the dimension of the action embedding layer.
Let Σ train ⊆ Σ be the set of characters in training data. Then, the distribution P gen t for generating actions over the vocabulary Ω train = Σ train ∪ {STEP} is modeled with the softmax function: When the predicted action is STEP, the attention index gets incremented i := i + 1, and so at the next time step t + 1, the model attends to vector h i+1 of the bidirectionally encoded lemma sequence.

Copy mechanism
Our copying mechanism is based on using a mixture of a generation probability distribution from Equation 3 and a copying probability distribution. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 t /s a t s s f f l l i i e g g e n /s x i 0 0 1 1 2 2 3 3 4 5 5 6 7 8 i 1 2 3 4 5 6 7 8 9 t f l o g y a t f l i e g g e nx i 1 2 3 4 5 5 6 7 7 i Table 1: Examples of generating German "flog" from "fliegen": HACM (left), HAEM (right). i is the attention pointer, x i the currently attended lemma character, a the sequence of actions, y the output, t the index over actions.
At each time step t, the action a ∈ Ω train is predicted from the following mixture distribution: The mix-in parameter of the generation distribution w gen t ∈ R is calculated from the concatenation of the state vector s t and the input vector that produces this state. The resulting vector is fed through a linear layer to the logistic sigmoid function: The mix-in parameter serves as a switch between a) generating a character from Σ train according to the generation distribution P gen t , and b) copying the currently attended character x i ∈ Σ train .
At test time, we allow the copying of out-ofvocabulary (OOV) symbols by adding the following modification to the mixture distribution in Equation 4: Therefore, if the currently attended symbol x i is OOV, we copy it with probability one according to the distribution 1 {a=x i } ; otherwise, we use the mixture of generation P gen t and copy 1 {a=x i } distributions. Thus, the distribution P t is built over an instance specific vocabulary Ω train ∪ {x i }. After copying the OOV symbol, we advance the attention pointer and use STEP as the previous predicted action.
The full architecture of the HACM system is shown schematically in Figure 3.

Learning
We train the system using cross-entropy loss, which, for a single input (x, y, f ), equates to: where x, y are lemma and inflected form character sequences, f the set of morpho-syntactic features, a the sequence of oracle actions derived from (x, y), Θ the model parameters and P t is the probability distribution over actions from Equation 4.

Second approach: Hard attention model over edit actions (HAEM)
This neural state-transition system also uses hard monotonic attention but transduces the lemma into the inflected form by a sequence of explicit edit actions: COPY, DELETE, and WRITE σ, σ ∈ Σ. The architectures of the two models are also different ( Figure 3).

Semantics of edit actions
COPY If the system generates COPY, the lemma character at the attention index x i is appended to the current output of the inflected form and the attention index is incremented i := i + 1. Therefore, unlike other neural morphological inflection generation systems, the copy character is not decoded from the neural network.
DELETE The system generates DELETE if it needs to increment the attention index.
WRITE σ Whenever the system chooses to append a character σ ∈ Σ to the current output of the inflected form, such that σ = x i where x i is the lemma character at the attention index, it generates the corresponding WRITE σ action.
Using this set of edit actions, the system can copy, delete, and substitute new characters. The substitution of a new character σ for a currently attended lemma character x i , σ = x i , is expressed as a sequence of one DELETE and one WRITE σ action.
This action set directly compares to the Ω = Σ ∪ {STEP} actions of the HACM model, which uses most basic actions to express edit operations. Crucially, in the HAEM system, character copying is a single action (which does not require character decoding) whereas it is typically a sequence of one WRITE σ (=σ) and one STEP action in HACM. 3 Further, HAEM effectively deals with OOV characters through COPY and DELETE actions.
STOP Additionally, to signal the end of transduction, the system generates a STOP action.

Deriving oracle actions
We use the character alignment methods of Section 2.3 to deterministically compute sequences of oracle actions for each training example using Algorithm 1. We then normalize all sub-sequences of only DELETE and WRITE σ in such a way that all DELETEs come before all WRITE σ actions. This simplifies unintuitive alignments produced by the smart aligner, especially at the low setting.
3 Except whenever the next alignment is 0-to-1 the HACM does not generate STEP. The HAEM system, however, increments the attention index on every COPY action.

Architecture
Similarly to HACM, the input lemma is encoded as a sequence of vectors h 1:n , h i ∈ R 2H with a single-layer bidirectional LSTM. Additionally, we use a single-layer LSTM to represent the predicted inflected form y 1:m , to which we refer as LSTM y . In case the model outputs a character with WRITE σ or COPY, LSTM y gets updated with the embedding of this character. At all time steps t, the system maintains a state s t ∈ R H from which it predicts the most probable action a t . The state sequence is derived differently. At time t, a concatenation of: 1. the currently attended vector h i ∈ R 2H , 2. the set-of-features vector f ∈ R |Φ| , 3. the output of the latest state y ∈ R H of the inflected form representation LSTM y , passes through a rectifier linear unit (ReLU) layer (Glorot et al., 2011) to finally produce the state vector s t . The probability distribution over valid actions 4 is then computed with softmax: This describes the basic form of the HAEM system ( Figure 3). In our experiments, we extend it to include two more representations: an LSTM that represents the action history, LSTM a , and another LSTM that encodes a sequence of deleted lemma characters, LSTM d . The deletion LSTM d gets emptied once a WRITE σ action is generated. In this way, we attempt to keep in memory a full representation of some sub-sequence of the lemma that needs to be replaced in the inflected form. In the extended system, the state s t is thus derived from an input vector [y; h i ; f ; a; d], where a ∈ R H is the output of the latest state of the action history LSTM a and d ∈ R H the output of the latest state of the deletion LSTM d .
The system is trained using the cross-entropy loss function as in Equation 7.

Experimental setup
We submit seven runs: a) two runs (1 and 2) for the HACM model; b) two runs (3 and 4) for the system HACM HAEM alignment S N S N low 5 5 medium 5 5 3 high 5 3 2 MAX { Run 5, Run 6 } HAEM model; and c) three runs (5, 6, and 7) that combine both systems. Detailed information on training regimes and the choice of hyperparameter values (e.g. layer dimensions, the application of dropout, etc.) for all the runs is provided in the Appendix. Crucially, for both systems and all settings and languages, we train models with both smart and naive alignments of Section 2.3. Table 2 shows the number of single models for each system, setting, and alignment. 5 We decode using greedy search. We apply a simple post-processing filter that replaces any inflected form containing an endlessly repeating character with the lemma. This affects a small number of test samples-57 for HACM and 238 for HAEM across all languages and alignment regimes-and primarily at the low setting.
All runs aggregate the results of multiple single models, and we use a number of aggregation strategies. For system runs 1 through 4, these are: Max strategy For each language l, we compute two ensembles over single models-one ensemble E(S) over smart alignment models and one ensemble E(N ) over naive alignment models. We then pick the ensemble with the highest development set accuracy for l: Ensemble n strategy For each language l, we pick at most n models from all single models such that they have the best development set accuracies for l. We then compute one ensemble over them: Runs 5, 6, and 7 are built with aggregation strategies that use as building blocks the MAX and ENSEMBLE n strategies. Table 3 shows the strategies employed in each run.
At the high setting, Runs 5, 6, and 7 additionally feature a single run produced with Nematus (Sennrich et al., 2017), a soft-attention encoderdecoder system for machine translation. In all these runs, the Nematus run complements the HAEM models, which perform much worse at the high setting on average. We refer the reader to the Appendix for further information on data preprocessing, hyperparameter values, and training for the Nematus run. Table 5 gives an overview on the average (macro) performance for each run on the official development and test sets at all settings. Accuracy measures the percentage of word forms that are inflected correctly (without a single character error). For the best system combination, we also report the average Levenshtein distance between the gold standard word form and the system prediction, which represents a softer criterion for correctness. Also, we include the performance of the shared task baseline system, which is a rule-based model that extracts prefix-changing and suffix-changing rules using alignments of each training sample with Levenshtein distance and associates the rules with the features of the sample. 6 All our official runs beat the baseline by a large margin on average in terms of accuracy and also in terms of Levenshtein distance. For all settings, we see an improvement by applying the more complex ensembling strategies (Table 3). It is the largest for low and the smallest for the high setting.   At the low setting, HAEM outperforms HACM on average by 2-3 percentage points accuracy and is, therefore, especially suited for a low resource situation. At the medium setting, the performance of HACM is slightly better using smart alignments. The HAEM system does not seem to learn well with naive alignment for this amount of data. The poorer performance of HAEM whenever more training data are available is particularly obvious at the high-resource setting where the difference between HACM and HAEM is quite large.

Results and Discussion
At the low setting, both the HACM and HAEM ensembles (Run 2 and Run 4) outperform the next best competitor (LMU-02-0 with 46.59%) by 0.23 and 1.94 percentage points in average accuracy. The margin between Run 7 and the next best system is an impressive 4.02 percentage points.
At the medium setting, our best Run 7 also outperforms the next best competitor (LMU-02-0 with 82.64%) with a small margin of 0.16 per-centage points. At the high setting, our best Run 7 loses against UE-LMU-01-0 with a small margin of 0.20 percentage points.
The performance of our best system varies strongly across languages (Figure 4). This is not only due to typological differences, but probably also because some languages have only inflection patterns for a single part-of-speech category (e.g. verbs in English) and other languages include nouns and adjectives (sometimes with very imbalanced class distributions). Naive alignment generally works slightly better than smart alignment at the low setting (but sometimes fails detrimentally as in the case of Khaling, Navajo, or Sorani). For the medium and high settings, smart alignment strongly outperforms naive alignment for HAEM, and a bit less so for HACM. For a few languages such as Turkish, Haida or Norwegian-Nynorsk, naive alignment is consistently better than smart alignment.
As future work, we will experiment more with the HAEM model and try to improve its capabilities for high-resource settings. One obvious option would be to use more fine-grained actions, for instance, directly learn substitutions for certain characters. This system would probably also profit from more consistent alignments. Even with smart alignments, we observe linguistically inconsistent character alignments that might also prevent useful generalizations.

Related work
Some task-specific work has been published after the 2016 edition of the SIGMORPHON Reinflection Shared Task ) that dealt with 10 languages, providing training material roughly at the size of the high setting of the 2017 task edition (a mean training data set size of 12,751 samples with a standard deviation of 3,303). The winning system of 2016 (Kann and Schütze, 2016a) showed that a standard sequence-to-sequence encoder-decoder architecture with soft attention (Bahdanau et al., 2014), familiar from neural machine translation, outperforms a number of other methods (as far as they were present in the task). Recently, Aharoni and Goldberg (2017) showed that hard monotonic attention works well when training data are scarce. Their approach exploits the almost monotonic alignment between the lemma and its inflected form. The HACM model extends this work with a copying mechanism similar to the pointergenerator model of See et al. (2017) and CopyNet of Gu et al. (2016). In HACM, the copying distribution, which is then mixed together with the generation distribution, is different: See et al. (2017) employ the soft-attention distribution whereas Gu et al. (2016) use a separately learned distribution. Our HACM model uses a simpler copying distribution that puts all the probability mass on the currently attended character. The logic of the HAEM model is similar to that of SIGMOR-PHON 2016's baseline which uses a linear classifier over hand-crafted features to predict edit actions. Grefenstette et al. (2015) extend an encoderdecoder model with neural data structures to better handle natural language transduction. Rastogi et al. (2016) present a neural finite-state approach to string transduction.

Conclusion
In this large-scale evaluation of morphological inflection generation, we show that a novel neural transition-based approach can deal well with an extreme low-resource setup. For a medium size training set of 1K items, HACM works slightly better. With abundant data (10K items), encoder/decoder architectures with soft attention are very strong, however, HACM achieves a comparable development set performance.
For optimal results, the ensembling of different system runs is important. We experiment with different ensembling strategies for eliminating bad candidates. At the low setting (100 samples), our best system combination achieves an average test set accuracy of 50.61% (an average Levenshtein distance (LD) of 1.29), at the medium setting (1K samples) 82.8% (LD 0.34), and at the high setting (10K samples) 95.12% (LD 0.11).