Attention-free encoder decoder for morphological processing

We present RACAI’s Entry for the CoNLL-SIGMORPHON 2018 shared task on universal morphological reinﬂection. The system is based on an attention-free encoder-decoder neural architecture with a bidirectional LSTM for encoding the input sequence and a uni-directional LSTM for decoding and producing the output. Instead of directly applying a sequence-to-sequence model at character-level we use a dynamic algorithm to align the input and output sequences. Based on these alignments we produce a series of special symbols which are similar to those of a ﬁnite-state-transducer (FST).


Introduction
Languages with rich morphology convey morphological attributes such as gender, case, number, obliqueness through character/grapheme variations applied to the dictionary form of the word (lemma). It is often the case where these variations are obtained by suffixing the word rather than altering random characters, but this does not hold for all languages or irregular word forms. Sill, the variations inside the lemma are usually small, requiring the system just to replace an average of 2.3 letters for the irregular word forms.
In our approach we exploit this property and employ an encoder-decoder sequence-to-sequence model that doesn't require an attention mechanism. This mitigates attention issues such as repeating or skipping character sequences and reduces the need for models with high representational capacity.
We exploit the property that alignments between the input and output character sequences are monotonic: for example wordform men and lemma man share two letters (alignments) in the same order, without inversions. The standard attention mechanism is well-suited for machine learning tasks; however, when it comes to monotonic alignments it sometimes fails to achieve satisfactory results, in most cases due to the fact that repeated characters or character sequences in the input sequence confuse the attention mechanism making it generate loops or skip characters.
There are several proposed methods that try to solve this task with attention mechanisms such as guided attention (Tachibana et al., 2017), locationsensitive attention (Chorowski et al., 2015) and other variations. Still, given the particularities of morphological reinflection, we argue that there is no need for an explicit attention mechanism. Instead we train the decoder to focus on a single input symbol at each time-step and "self-attend" by shifting the input cursor with one position at a time. This method, though developed independently, closely resembles that of Makarov et al. (2017).
In our previous experiments we used this architecture to perform lemmatization (the opposite task of morphological reinflection) and we obtained state-of-the-art results.
In what follows, we will present the attentionfree encoder-decoder architecture (Section 2), we show our experimental results (Section 3) and finally we draw conclusions (Section 4).

Attention-free encoder-decoder
The architecture of our neural network is fairly simple. We use an encoder that "sees" the sequence in both directions and a decoder which is conditioned to produce the output sequence using focused encoder states (see below for details) concatenated with trainable embeddings computed on morphological attributes.
As mentioned before, the classical attention mechanism is not well suited for tasks where the alignments between the input and output se-quences are monotonic. Instead, connectionist temporal classification (CTC) (Graves et al., 2006) provides better results in these cases. However, CTC requires that the number of input time-steps is much higher than the number of output labels. This renders CTC unsuitable for morphological reinflection as the number of output labels is almost always greater than the number of input characters.
Instead, we propose a simpler algorithm that reduces the model complexity and computational load. Our method requires preexisting alignments between the input and output sequences. These alignments are easy to obtain by exploiting a basic property of morphological reinflection, which also holds for lemmatization: regardless of the language and irregularity of the word-form, the lemma and the inflected word form share many symbols.
This implies a high likelihood of aligning identical input and output symbols and does not require Expectation Maximization (EM) for computing alignment probabilities. With this in mind, we propose the following algorithm that: 1. Computes an alignment matrix using dynamic programming; 2. Reads the two sequences in reverse and uses the previously computed matrix, favoring diagonal alignments over other alignments; 3. Generates alignment pairs, whenever the input and output symbols are identical.
Figure 1 describes our approach step-by-step. The algorithm is a slightly modified dynamic algorithm in the sense that (a) it favors diagonal alignments (to cope with repeating consecutive letters) and (b) it only considers an alignment pair (i, j) if the characters from the source (s) and destination (d) at the two indexes are identical (i.e. s i = d j ).
Next, we use the produced alignments to generate the training data for our attention-free encoderdecoder model. For our algorithm to work, we need the decoder to keep track of the focused-on character in the input sequence. This is achieved by simulating a FST using neural networks. Given the input sequence s, the decoder must produce an output sequence d which is composed of three specialized labels and arbitrary characters in the vocabulary. The output symbols are: s -input sequence of size n d -output sequence of size m a <− z e r o s ( n +1 , m+ 1 ) a l i g n m e n t s <− a l i g n m e n t s + ( i −1 , j −1) r e t u r n a l i g n m e n t s • Special Symbol COPY : The character at the current focus-index must be "copied" in order to compose the final sequence; • Special Symbol EOS : The output sequence is complete and the algorithm stops; • Any arbitrary character in the vocabulary: This means that the final sequence must be obtained by adding this character.
At runtime we start by setting the focus-index at 0 and the final sequence to the void string ("") and we follow the instructions of the decoder output in order to construct the final sequence. During training, it is highly important to do sanity checks on the current focus-index to avoid index out-ofbounds exceptions during the first training epochs when the model has not yet converged. Once the loss is small enough, we found that the model rarely generates these exceptions. However, it is still recommended to keep these checks in place.
To obtain the output sequence d on which we train our network we use a fixed-oracle algorithm that is summarized as: 1. Take every symbol in the output sequence and check if it aligns with a symbol in the input sequence (based on the alignments produced by the algorithm in Figure 1); 2. If the output symbol does not align with any character, instruct the decoder to generate it (the case of the arbitrary character in the vocabulary); 3. If the output symbol aligns, instruct the decoder to generate INC symbols until the focus-index would reach the corresponding input character, and then generate an COPY symbol; 4. When the sequence is completely generated, instruct the decoder to generate an EOS symbol.
Because English reinflection is fairly simple, we chose an entry from the Romanian dataset for which we present a step-by-step example.
Assume the lemma is "face" (en. "to do") and is has to be reinflected for the morphological description V;IND;PST;3;PL;IPFV. The decoder has to generate word form "fȃceau" (en. "they were doing"). This means that the inflected form is obtained by replacing the character 'a' with the character 'ȃ' and by adding the suffix "au". Figure 2 shows the alignments obtained via dynamic programming between the characters of the lemma (up) and the characters of the word form (down). The dashed lines correspond to alignments where the characters in the source and destination are not identical. The final alignments pairs are (according to the straight lines): (0,0), (2,2) and (3,3).
Based on these alignments, the FST symbols generated by the fixed-oracle algorithm are: COPY , 'Ȃ', INC , INC , COPY , INC , COPY , 'A', 'U', EOS . Notice that after copying the first symbol ('F') to the output, the oracle immediately generates the vocabulary item 'Ȃ', because it is not aligned with any symbol in the source lemma. However, the next (3rd) symbol in the destination string is aligned with a character in the source string and the index is incremented with two INC commands. The rest of the sequence is generated in a similar fashion. Note 1: Fixed-oracle training is known to produce suboptimal results, when compared with dynamic-oracle training. However, we did not have time to experiment with the later mentioned training method and leave this for future work.

Training details and experimental results
For our implementation is based on DyNET (Neubig et al., 2017), which is a dynamic computation graph network framework. That means that we do not require any padding when we prepare minibatches. We evaluated our approach on the data provided during the SIGMORPHON 2018 Shared Task on morphological reinflection (Cotterell et al., 2018). During the evaluation campaign, each language was provided with 3 datasets of different sizes (high, medium and low). Because, neural approaches traditionally require more training data to generalize better, we only built models for the "high" datasets, which were composed of 10K training examples for each language.
Our model was trained using ADAM optimization (Kingma and Ba, 2014), with the default parameters α = 1e −3 , β 1 = 0.9 and β 2 = 0.999. We used a mini-batch size of 1K words and we used trained each model until the accuracy on the development set stopped improving for 20 iterations. At the end, we used the best performing model for each languages. For all languages we used a two-layer encoder with 200 LSTM cells (in each direction -total 400 cells per layer) and a two-layer decoder of 200 unidirectional cells. Each character in the vocabulary is embedded as a 100-dimensional vector. We also use a 100-dimensional embedding size for each unique morphological descriptor. Table 1 summarizes the testset results for all languages in the SIGMORPHON Challenge 2018. During the official evaluation campaign, our system was affected by a bug which caused all weights belonging to non-recurrent cells to be constant (not trainable during backprop). This issue had a strong negative impact on the results. After this, we retrained our models and we include the unofficial results in the same table, under the "Acc.*" column. For almost all languages, after correcting the bug, the accuracy strongly in-creased; for Welsh we observed no increase, and only for 2 languages did we observe a less than 1 point decrease (probably due to weight initialization compounded by small models where the LSTMs overcame the fixed random weights of the dense layers). Overall, we observed a strong result increase, from an average of 72.49 to 83.77. For example, for West Frisian where initially the model would not converge (0.00), we now obtain 93.00; similarly, for Armenian, we have gone from 0.00 to 93.9. pler models. This is mainly (a) because our model introduces the COPY operation and reduces the representational load of the encoderdecoder model and (b) and because we keep track of the focus-index externally.
Also, we reduce the computational complexity of the model by completely removing calculation involved in the soft attention mechanism (n * m matrix multiplications, where n is the size of the input sequence and m the size of the output sequence).
Moreover, the fact that the decoder does not require taking the previous output and embedding it as input for the next step, demonstrates that there is far less representational overhead involved in generating the output sequence.
As a side note, in our previous experiments with lemmatization, we observed that using this model yields a 2-5% absolute increase in accuracy over the standard soft-attention sequence-to-sequence model.