Hard Non-Monotonic Attention for Character-Level Transduction

Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map an input string to an output string, where the strings may be of different lengths and have characters taken from different alphabets. Recent approaches have used sequence-to-sequence models with an attention mechanism to learn which parts of the input string the model should focus on during the generation of the output string. Both soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been used in other sequence modeling tasks and has required a stochastic approximation to compute the gradient. In this work, we introduce an exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1. We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the stochastic approximation and outperforms soft attention.


Introduction
Many natural language tasks are expressible as string-to-string transductions operating at the character level.Probability models with recurrent neural parameterizations currently hold the state of the art on many such tasks.On those string-to-string transduction tasks that involve a mapping between two strings of different lengths, it is often necessary to resolve which input symbols are related to which output symbols.As an example, consider the task of transliterating a Russian word into the Latin alphabet.In many cases, there exists a one-totwo mapping between Cyrillic and Latin letters: in Хурщёв (Khrushchev), the Russian Х can be considered to generate the Latin letters Kh.Supervision is rarely, if ever, provided at the level of characterto-character alignments-this is the problem that attention seeks to solve in neural models.
With the rise of recurrent neural networks, this problem has been handled with "soft" attention rather than traditional hard alignment.Attention (Bahdanau et al., 2015) is often described as "soft," as it does not clearly associate a single input symbol with each output symbol, but rather offers a fuzzy notion of what input symbols may be responsible for which symbols in the output.In contrast, an alignment directly associates a given input symbol with a given output symbol.To express uncertainty, practitioners often place a distribution over the exponential number of hard nonmonotonic alignments, just as a probabilistic parser places a distribution over an exponential number of trees.The goal, then, is to learn the parameters of this distribution over all non-monotonic alignments through backpropagation.Incorporating hard alignment into probabilistic transduction models dates back much farther in the NLP literature; arguably, originating with the seminal paper by Brown et al. (1993).Some neural approaches have moved back towards this approach of a more rigid alignment, referring to it as "hard attention."We will refer to this as "hard attention" and to more classical approaches as "alignment." This paper offers two insights into the usage of hard alignment.First, we derive a dynamic program for the exact computation of the likelihood in a neural model with latent hard alignment: Previous work has used a stochastic algorithm to approximately sum over the exponential number of alignments between strings.In so doing, we go on to relate neural hard alignment models to the classical IBM Model 1 for alignment in machine translation.Second, we provide an experimental comparison that indicates hard attention models outperform soft attention models on three character-level string-tostring transduction tasks: grapheme-to-phoneme conversion, named-entity transliteration and morphological inflection.

Non-Monotonic Transduction
This paper presents a novel, neural, probabilistic latent-variable model for non-monotonic transduction.As a concrete example of a non-monotonic m e j r m e j r m e j Figure 1: Example of a non-monotonic character-level transduction from the Micronesian language of Pingelapese.The infinitive mejr is mapped through a reduplicative process to its gerund mejmejr (Rehg and Sohl, 1981).Each input character is drawn in green and each output character is drawn in purple, connected with a line to the corresponding input character.transduction, consider the mapping of a Pingelapese infinitive to its gerund, as shown in Fig. 1.The mapping requires us to generate the output string left-to-right, bouncing around the input string out-of-order to determine the characters to transduce from.As the non-monotonic alignment is the latent variable, we will face a combinatorial problem: summing over all non-monotonic alignments.The algorithmic contribution of this paper is the derivation of a simple dynamic program for computing this sum in polynomial time that still allows for very rich recurrent neural featurization of the model.With respect to the literature, our paper represents the first instance of exact marginalization for a neural transducer with hard non-monotonic alignment; previous methods, such as Rastogi et al. (2016) and Aharoni and Goldberg (2017), are exclusively monotonic.
Non-monotonic methods dominate characterlevel transduction.Indeed, the state of art in classic character-level NLP tasks such as graphemeto-phoneme conversion (Yao and Zweig, 2015), transliteration (Rosca and Breuel, 2016) and morphological inflection generation (Kann and Schütze, 2016) is held by the soft non-monotonic method of Bahdanau et al. (2015).Even though non-monotonicity is more common in word-level tasks, it also exists in character-level transduction tasks, as evidenced by our example in Fig. 1 and the superior performance of non-monotonic methods.Our error analysis in §8.4 sheds some light on why non-monotonic methods are the state of the art in a seemingly monotonic task.
A Note on the Character-level Focus.A natural question at this point is why we are not experimenting with word-level transduction tasks, such as machine translation.As we show in the §3.2 our method is often an order of magnitude slower than standard soft attention.Thus, the exact marginalization scheme is practically unworkable for machine translation; we discuss future extensions for machine translation in §6.However, the slow-down is no problem for character-level tasks and we show empirical gains in §8.

The Latent-Variable Model
An alphabet is a finite, non-empty set.Given two alphabets Σ x = {x 1 , . . ., x |Σx| } and Σ y = {y 1 , . . ., y |Σy| }, probabilistic approaches to the problem attempt to estimate a probability distribution p(y | x) where y ∈ Σ * y and x ∈ Σ * x .Foreshadowing, we will define the parameters of p to be, in part, the parameters of a recurrent neural network, in line with the state-of-the-art models.We define the set A = {1, . . ., |x|} |y| , which has an interpretation as the set of all (potentially non-monotonic) alignments from x to y with the implicit constraint that each output symbol y i aligns to exactly one symbol in x ∈ Σ * x .In other words, A is the set of all many-to-one alignments between x and y where many may be as few as zero.We remark that |A| = |x| |y| , which is exponentially large in the length of the target string y.For an a ∈ A, A i = a i refers to the event that y i , the i th component of y, is aligned to x a i , the a i th component of x.We define a probability distribution over output strings y conditioned on an input string x where we marginalize out unobserved alignments a: where the transition from (1a) to (1b) follows from the independence assumption and we define α j (i) = p(a i | y <i , x) and substitute j = a i , in order to better notationally compare our model to that of Bahdanau et al. (2015) in §5.Each distribution p(y i | j, y <i , x) in the definition of the model has a clean interpretation as a distribution over the output vocabulary Σ y , given an input string x ∈ Σ * x , where y i is aligned to x j .Thus, one way of thinking about this hard alignment model is as a product of mixture models, one mixture at each step, with mixing coefficients α j (i).
A Note on EOS.We suppressed EOS in the autoregressive models above for brevity, e.g., p(y i | j, y <i , x) is a distribution over Σ y ∪ {EOS}, in order for p(y | x) to be a probability distribution.
Why Does Dynamic Programming Work?Our dynamic program to compute the likelihood, fully specified in eq. ( 1c), is quite simple: The nonmonotonic alignments are independent of each other, i.e., α j (i) is independent of α j (i − 1), conditioned on the observed sequence y.This means that we can cleverly rearrange the terms in eq. ( 1b) using the distributive property.Were this not the case, we could not do better than having an exponential number of summands.This is immediately clear when we view our model as a graphical model, as in Fig. 2: There is no active trail from a i to a k where k > i, ignoring the dashed lines.Note that this is no different than the tricks used to achieve exact inference in n th -order Markov models-one makes an independence assumption between the current bit of structure and the previous bits of structure to allow an efficient algorithm.For a proof of eq.(1b)eq.(1c), one may look in Brown et al. (1993).Foreshadowing, we note that certain parameterizations make use of input feeding (Luong et al., 2015, §3.3), which breaks this independence; see §5.1.
Relation to IBM Model 1.The derivation above is similar to that of the IBM alignment model 1.We remark, however, two key generalizations that will serve our recurrent neural parameterization well in §4.First, traditionally, derivations of IBM Model 1 omits a prior over alignments p(a i | x), taking it to be uniform.Due to this omission, an additional multiplicative constant ε /|x| |y| is introduced to ensure the distribution remains normalized (Koehn, 2009).Second, IBM Model 1 does not condition on previously generated words on the output side.In other words, in their original model, Brown et al. (1993) assume that p(y i | a i , y <i , x) = p(y i | a i , x), forsaking dependence on y <i .We note that there is no reason why we need to make this independence assumptionwe will likely want a target-side language model in transduction.Indeed, subsequent statistical machine translation systems, e.g., MOSES (Koehn et al., 2007), integrate a language model into the decoder.It is of note that many models in NLP have made similar independence assumptions, e.g., the emission distribution hidden Markov models (HMMs) are typically taken to be independent of all previous emissions (Rabiner, 1989).These assumptions are generally not necessary.(1993)-it has been forgotten in recent formulations of hard alignment (Xu et al., 2015), which use stochastic approximation to handle the exponential summands.As we will see in §5, we can compute the soft-attention model of Bahdanau et al. (2015) in

Algorithmic
When Σ y is large, for example in the case of machine translation with tens of thousands of Σ y at least, we can ignore |x| • |y| in soft-attention model, and the exact marginalization has an extra |x|-factor compared to soft-attention model.In practice, Shi and Knight (2017) show the bottleneck of an NMT system is the softmax layer, making the extra |x|-factor practically cumbersome.

Recurrent Neural Parameterization
How do we parameterize p(y i | a i , y <i , x) and α j (i) in our hard, non-monotonic transduction model?We will use a neural network identical to the one proposed in the attention-based sequenceto-sequence model of Luong et al. (2015) without input feeding (a variant of Bahdanau et al. (2015)).

Encoding the Input
All models discussed in this exposition will make use of the same mechanism for mapping a source string x ∈ Σ * x into a fixed-length representation in R d h .This mapping will take the form of a bidirectional recurrent neural network encoder, which works as follows: each element of Σ x is mapped to an embedding vector of length d e through a mapping: e : Σ x → R de .Now, the RNN folds the Figure 2: Our hard-attention model without input feeding viewed as a graphical model.Note that the circular nodes are random variables and the diamond nodes are deterministic variables (h (dec)   i is first discussed in §4.3).The independence assumption between the alignments ai when the yi are observed becomes clear.Note that we have omitted arcs from x to y1, y2, y3, and y4 for clarity (to avoid crossing arcs).We alert the reader that the dashed edges show the additional dependencies added in the input feeding version, as discussed in §5.1.Once we add these in, the ai are no longer independent and break exact marginalization.Note the hard-attention model does not enforce an exact one-to-one constraint.Each source-side word is free to align with many of the target-side words, independent of context.In the latent variable model, the x variable is a vector of source words, and the alignment may be over more than one element of x.
following recursion over the string x left-to-right: where we fix the 0 th hidden state h (enc) 0 to the zero vector and the matrices enc) ∈ R d h are parameters to be learned.Performing the same procedure on the reversed string and using an RNN with different parameters, we arrive at hidden state vectors ← − h (enc) j .The final hidden states from the encoder are the concatenation of the two, i.e., h (enc) , where ⊕ is vector concatenation.As has become standard, we will use an extension to this recursion: we apply the long shortterm memory (LSTM; Hochreiter and Schmidhuber, 1997) recursions, rather than those of a vanilla RNN (Elman network;Elman, 1990).

Parameterization.
Now, we define the alignment distribution where , the decoder RNN's hidden state, is defined in §4.3.Importantly, the alignment distribution α j (i) at time step i will only depend on the prefix of the output string y <i generated so far.This is clear since the output-side decoder is a unidirectional RNN.We also define The function f is a non-linear and vector-valued; one popular choice of f is a multilayer perceptron with parameters to be learned.We define where S ∈ R ds×3d h .

Updating the hidden state h (dec)
i The hidden state h (dec)   i is also updated through the LSTM recurrences (Hochreiter and Schmidhuber, 1997).The RNN version of the recurrence mirrors that of the encoder, where e (dec) : Σ y → R de produces an embedding of each of the symbols in the output alphabet.What is crucial about this RNN, like the α j (i), is that it only summarizes the characters decoded so far independently of the previous attention weights.In other words, the attention weights at time step i will have no influence from the attention weights at previous time steps, shown in Fig. 2.This is what allows for dynamic programming.

Transduction with Soft Attention
In order to contrast it with the hard alignment mechanism we develop, we here introduce Luong attention (Luong et al., 2015) for recurrent neural sequence to sequence models (Sutskever et al., 2014).
Note that this model will also serve as an experimental baseline in §8.
The soft-attention transduction model defines a distribution over the output Σ * y , much like the hardattention model, with the following expression: where we define each conditional distribution as over Σ y ∪ {EOS}.We reuse the function f in eq. ( 5).The hidden state h (dec) i , as before, is the i th state of a target-side language model that summarizes the prefix of the string decoded so far; this is explained in §4.3.And, finally, we define the context vector using the same alignment distribution as in §4.2.In the context of the soft-attention model, this distribution is referred to as the attention weights.
Inspection shows that there is only a small difference between the soft-attention model presented here and our hard non-monotonic attention model.The difference is where we place the probabilities α j (i).In the soft-attention version, we place them inside the softmax (and the function f ), as in eq. ( 8), and we have a mixture of the encoder's hidden states, the context vector, that we feed into the model.On the other hand, if we place them outside the softmax, we have a mixture of softmaxes, as shown in eq. ( 1c).Both models have identical set of parameters.

Input Feeding: What's That?
The equations in eq. ( 6), however, are not the only approach.Input-feeding is another popular approach that is, perhaps, standard at this point (Luong et al., 2015).Input feeding refers to the setting where the architecture designer additionally feeds the attention weights into the update for the decoder's hidden state.This yields the recursion (de+ds) .This is the architecture discussed in Bahdanau et al. (2015, §3.1).In contrast to the architecture above, this architecture has attention weights that do depend on previous attention weights due to the feeding in of the context vector c i .See Cohn et al. (2016) for another attempt to incorporate structural biases directly into the attention mechanism such that the attention distribution is influenced by previous attention distributions.

Combining Hard Non-Monotonic Attention with Input Feeding
To combine hard attention with input feeding, Xu et al. (2015) derive a variational lower bound on the log-likelihood through Jensen's inequality: Note that we have omitted the dependence of p(a | x) on the appropriate prefix of y; this was done for notational simplicity.Using this bound, Xu et al. (2015) derive an efficient approximation to the gradient using the REINFORCE trick of Williams (1992).This sampling-based gradient estimator is then used for learning but suffers from high variance.We compare our dynamic programming version to this model in §8.
6 Future Work  (Vogel et al., 1996) that generalize IBM Model 1, are easily bolted onto our proposed model as well.If we are willing to perform approximate inference, we may also consider fertility as found in IBM Model 4.
In order to extend our method to machine translation (MT) in any practical manner, we require an approximation to the softmax.Given that the softmax is already the bottleneck of neural MT models (Shi and Knight, 2017), we can not afford ourselves a O(|x|) slowdown during training.Many methods have been proposed for approximating the softmax (Goodman, 2001;Bengio et al., 2003;Gutmann and Hyvärinen, 2010).More recently, Chen et al. ( 2016) compared methods on neural language modeling, and Grave et al. (2017) proposed a GPU-friendly method.

The Tasks
The empirical portion of the paper focuses on character-level string-to-string transduction problems.We consider three tasks: G : grapheme-tophoneme conversion, T : named-entity transliteration, and I : morphological inflection.We describe each briefly in turn and we give an example of a source and target string for each task in Tab. 1.
Grapheme-to-Phoneme Conversion.We use the standard grapheme-to-phoneme conversion (G2P) dataset: the Sphinx-compatible version of CMUDict (Weide, 1998) and NetTalk (Sejnowski and Rosenberg, 1987).G2P transduces a word, a string of graphemes, to its pronunciation, a string of phonemes.We evaluate with word error rate (WER) and phoneme error rate (PER) (Yao and Zweig, 2015).PER is equal to the edit distance divided by the length of the string of phonemes.
Named-Entity Transliteration.We use the NEWS 2015 shared task on machine transliteration (Zhang et al., 2015) as our named-entity transliteration dataset.It contains 14 language pairs.Transliteration transduces a named entity from its source language to a target language-in other words,  from a string in the source orthography to a string in the target orthography.We evaluate with word accuracy in percentage (ACC) and mean F-score (MFS) (Zhang et al., 2015).For completeness, we include the definition of MFS in App. A.
Morphological Inflection.We consider the high-resource setting of task 1 in the CoNLL-SIGMORPHON 2017 shared task (Cotterell et al., 2017) as our morphological inflection dataset.It contains 51 languages in the high-resource setting.Morphological inflection transduces a lemma (a string of characters) and a morphological tag (a sequence of subtags) to an inflected form of the word (a string of characters).We evaluate with word accuracy (ACC) and average edit distance (MLD) (Cotterell et al., 2017).

Experiments
The goal of the empirical portion of our paper is to perform a controlled study of the different architectures and approximations discussed up to this point in the paper.§8.1 exhibits the neural architectures we compare and the main experimental results1 are in Tab. 3. In §8.2, we present the experimental minutiae, e.g.hyperparameters.In §8.3, we analyze our experimental findings.Finally, in §8.4,we perform error analysis and visualize the soft attention weight and hard alignment distribution.

The Architectures
The four architectures we consider in controlled comparison are: 1 : soft attention with input feeding, 2 : hard attention with input feeding, 3 : soft attention without input feeding and 4 : hard attention without input feeding (our system).They are also shown in Tab. 2. As a fifth system, we compare to the monotonic system M : Aharoni and Goldberg (2017).Additionally, we present U , a variant of 1 where the number of parameters is not controlled for, and R , a variant of 4 trained using REINFORCE instead of exact marginalization.

Experimental Details
We implement the experiments with PyTorch (Paszke et al., 2017) and we port the code of Aharoni and Goldberg (2017) to admit batched training.Because we did not observe any improvements in preliminary experiments when decoding with beam search 2 , all models are decoded greedily.
Data Preparation.For G , we sample 5% and 10% of the data as development set and test set, respectively.For T , we only run experiments with 11 out of 14 language pairs 3 because we do not have access to all the data.
Model Hyperparameters.The hyperparameters of all models are in Tab. 4. The hyperparameters of the large model are tuned using the baseline 3 on selected languages in I , and the search range is 2 Compared to greedy decoding with an average error rate of 20.1% and an average edit distance of 0.385, beam search with beam size 5 gets a slightly better edit distance of 0.381 while hurting the error rate with 20.2%.
shown in Tab. 4. All three tasks have the same two sets of hyperparameters.To ensure that 1 has the same number of parameters as the other models, we decrease d s in eq. ( 5) while for the rest of the models d s = 3d h .Additionally, we use a linear mapping to merge e (dec) (y i−1 ) and ci−1 in eq. ( 10) instead of concatenation.The output of the linear mapping has the same dimension as e (dec) (y i−1 ), ensuring that the RNN has the same size.
M has quite a different architecture: The input of the decoder RNN is the concatenation of the previously predicted word embedding, the encoder's hidden state at a specific step, and in the case of I , the encoding of the morphological tag.Differing from Aharoni and Goldberg (2017), we concatenate all attributes' embeddings (0 for attributes that are not applicable) and merge them with a linear mapping.The dimension of the merged vector and attributes vector are d e .To ensure that it has the same number of parameters as the rest of the model, we increase the hidden size of the decoder RNN.
Optimization.We train the model with Adam (Kingma and Ba, 2015) with an initial learning rate of 0.001.We halve the learning rate whenever the development log-likelihood doesn't improve.We stop after the learning rate dips to 1 × 10 −5 .We save all models after each epoch and select the model with the best development performance.We train the model for at most 50 epochs, though all the experiments stop early.We train on G , T , and I with batch sizes of 20, 50 and 20, respectively.We notice in the experiments that the training of 1 and U is quite unstable with the large model, probably because of the longer chain of gradient information flow.We apply gradient clipping to the large model with maximum gradient norm 5.

REINFORCE.
In the REINFORCE training of R and 2 , we sample 2 and 4 positions at each time e t o l l u t l ä n t ä n n y t < \ s > step for the small and large models, respectively.The latter is tuned on selected languages in I with search range {2,3,4,5}.To stabilize the training, we apply a baseline with a moving average reward and discount factor of 0.9, similar to Xu et al. (2015).

Experimental Findings
Finding #1: Effect of Input Feeding.By comparing 3 and 4 against 1 and 2 in Tab. 3, we find input feeding hurts performance in all settings and all tasks.This runs in contrast to the reported results of Luong et al. (2015), but they experiment on machine translation, rather than character-level transduction.This validates our independence assumption about the alignment distribution.
Training with REINFORCE hurts the performance of the hard attention model; compare 1 and 2 (trained with REINFORCE), in Tab. 3. On the other hand, training with exact marginalization causes the hard attention model to outperform the soft attention model in nearly all settings; compare 3 and 4 in Tab. 3.This comparison shows that hard attention outperforms soft attention in characterlevel string transduction when trained with exact marginalization.
The monotonic model M underperforms compared to non-monotonic models 3 in Tab. 3 except for one setting.It performs slightly worse on T and G due to the many-to-one alignments in the data and the fact that Aharoni and Goldberg (2017) can only use the hidden vector of the final element of the span in a many-to-one alignment to directly predict the one target element.The current state-of-the-art systems for character-level string transduction are non-monotonic models, despite the tasks' seeming monotonicity; see §8.4.
Finding #4: Approximate Hard Attention.Table 5: Breakdown of correct and incorrect predictions of monotonic and non-monotonic alignments of 3 and 4 in G , derived from the soft attention weights and the hard alignment distribution method for neural models with hard attention, a natural question to ask is how much the exact marginalization helps during learning.By comparing 4 and R in Tab. 3, we observe that training with exact marginalization clearly outperforms training under stochastic approximation in every setting and on every dataset.We also observe that the exact marginalization allows faster convergence, since training with REINFORCE is quite unstable where some runs seemingly to get stuck.

Given our development of an exact marginalization
Finding #5: Controlling for Parameters.Input feeding yields a more expressive model, but also leads to an increase in the number of parameters.
Here, we explore what effect this has on the performance of the models.In their ablation, Luong et al. (2015) did not control the number of parameters when adding input feeding.The total number of parameters of U is 1.679M for the small setting and 10.541M for the large setting, which has 40% and 22.3% more parameters than the controlled setting.
By comparing 1 and U in Tab. 3, we find that the increase in parameters, rather than the increase in expressivity explains the success of input feeding.

Visualization and Error Analysis
We hypothesize that even though the model is non-monotonic, it can learn monotonic alignment with flexibility if necessary, giving state-of-the-art results on many seemingly monotonic characterlevel string transduction tasks.To show more insights, we compare the best soft attention model ( 3 ) against the best hard alignment model ( 4) on G by showing the confusion matrix of each model in Tab. 5.An alignment is non-monotonic when alignment edges are predicted by the model cross.
There is an edge connecting x j and y i if the attention weight or hard alignment distribution α j (i) is larger than 0.1.We find that the better-performing transducers are more monotonic, and most learned alignments are monotonic.The results indicate that there are a few transductions that are indeed nonmonotonic in the dataset.However, the number is so few that this does not entirely explain why non-monotonic models outperform the monotonic models.We speculate this lies in the architecture of Aharoni and Goldberg (2017), which does not permit many-to-one alignments, while monotonic alignment learned by the non-monotonic model is more flexible.Future work will investigate this.In Fig. 3, we visualize the soft attention weights ( 3 ) and the hard alignment distribution ( 4 ) side by side.We observe that the hard alignment distribution is more interpretable, with a clear boundary when predicting the prefixes.

Conclusion
We exhibit an efficient dynamic program for the exact marginalization of all non-monotonic alignments in a neural sequence-to-sequence model.We show empirically that the exact marginalization helps over approximate inference by REINFORCE and that models with hard, non-monotonic alignment outperform those with soft attention.

Table 1 :
Example of source and target string for each task as processed by the model do the same.Extensions, resembling those of IBM Model 2 and the HMM aligner Brown et al. (1993) (1993)started with IBM Model 1 and built up to richer models, we can

Table 2 :
The 4 architectures considered in the paper.

Table 3 :
Average test performance on G , T and I averaged across datasets and languages.See App.B for full breakdown.

Table 4 :
Model hyperparameters and search range

Table 8 :
Full breakdown of SIGMORPHON2017 with small model

Table 9 :
Full breakdown of SIGMORPHON2017 with large model