Exact Hard Monotonic Attention for Character-Level Transduction

Many common character-level, string-to-string transduction tasks, e.g., grapheme-to-phoneme conversion and morphological inflection, consist almost exclusively of monotonic transduction. Neural sequence-to-sequence models with soft attention, non-monotonic models, outperform popular monotonic models. In this work, we ask the following question: Is monotonicity really a helpful inductive bias in these tasks? We develop a hard attention sequence-to-sequence model that enforces strict monotonicity and learns alignment jointly. With the help of dynamic programming, we are able to compute the exact marginalization over all alignments. Our models achieve state-of-the-art performance on morphological inflection. Furthermore, we find strong performance on two other character-level transduction tasks. Code is available at https://github.com/shijie-wu/neural-transducer.


Introduction
Many tasks in natural language processing can be treated as character-level, string-to-string transduction.The current dominant method is the neural sequence-to-sequence model with soft attention (Bahdanau et al., 2015;Luong et al., 2015).This method has achieved state-of-the-art results in a plethora of tasks, for example, grapheme-tophoneme conversion (Yao and Zweig, 2015), named-entity transliteration (Rosca and Breuel, 2016) and morphological inflection generation (Cotterell et al., 2016).While soft attention is very similar to a traditional alignment between the source characters and target characters in some regards, it does not explicitly model a distribution over alignments.On the other hand, neural sequence-to-sequence models with hard attention are analogous to the classic IBM models for machine translation, which do model the alignment distribution explicitly (Brown et al., 1993).
The standard versions of soft and hard attention are non-monotonic.However, if we look at the data in grapheme-to-phoneme conversion, named-entity transliteration, and morphological inflection (examples are shown in Fig. 1), we see that the tasks require almost exclusively monotonic transduction.Yet, counterintuitively, the state of the art in highresource morphological inflection is held by nonmonotonic models (Cotterell et al., 2017)!Indeed, in a recent controlled experiment, Wu et al. (2018) found non-monotonic models (with either soft or hard attention) outperform popular monotonic models (Aharoni and Goldberg, 2017) in the three above-mentioned tasks.However, the inductive bias of monotonicity, if correct, should help learn a better model or, at least, learn the same model.
In this paper, we hypothesize that the underperformance of monotonic models stems from the lack of joint training of the alignments with the transduction.Generalizing the model of Wu et al. (2018) to enforce monotonic alignments, we show that, for all three tasks considered, monotonicity is a good inductive bias and jointly learning a monotonic alignment improves performance.We provide an exact, cubic-time dynamic-programming inference algorithm to compute the log-likelihood and an approximate greedy decoding scheme.Empirically, our results indicate that, rather than the pipeline systems of Aharoni and Goldberg (2017) and Makarov et al. (2017), we should jointly train monotonic alignments with the transduction model, and, indeed, we set the single-model state of the art on the task of morphological inflection.1 2 Hard Attention

Preliminary
We assume the source string x ∈ Σ * x and the target string y ∈ Σ * y are drawn from finite vocabularies Σ x = {x 1 , . . ., x |Σx| } and Σ y = {y 1 , . . ., y |Σy| }, respectively.In tasks where the tag is provided, i.e., labeled transduction (Zhou and Neubig, 2017),  we denote the tag as an ordered set t ∈ Σ * t drawn from a finite tag vocabulary Σ t = {t 1 , . . ., t |Σt| }.We define the set A = {1, . . ., |x|} |y| to be set of all non-monotonic alignments from x to y where an alignment aligns each target character y i to exactly one source character in x. 2 In other words, it allows zero-to-one 3 or many-to-one alignments between x and y.For an a ∈ A, A i = a i refers to the event that y i is aligned to x a i , which are the i th character of y and the a i th character of x, respectively.In general, we will shorten the expression A i = a i to a i for brevity.

0 th -order Hard Attention
Hard attention was first introduced to the literature by Xu et al. (2015).We, however, follow Wu et al. (2018) and use a tractable variant of hard attention and model the probability of a target string y given an input string x as follows polynomial number of terms where we show how one can rearrange the terms to compute the function in polynomial time.
The model above is exactly an 0 th -order neuralized hidden Markov model (HMM).Specifically, p(y i | a i , y <i , x) can be regarded as an emission distribution and p(a i | y <i , x) can be regarded as a transition distribution, which does not condition on the previous alignment.Hence, we will refer to this 2 We write A in the remainder with x and y implicit. 3Zero in the sense of a non-character like BOS or EOS model as 0 th -order hard attention.The likelihood can be computed in

1 st -order Hard Attention
To enforce monotonicity, hard attention with conditionally independent alignment decisions is not enough: The model needs to know the previous alignment position when determining the current alignment position.Thus, we allow the transition distribution to condition on the previous alignment p(a i | a i−1 , y <i , x) and it becomes a 1 st -order neuralized HMM.We display this model as a graphical model in Fig. 2. We will refer to it as 1 st -order hard attention.Generalizing the 0 th -order model, we define the 1 st -order extension as follows polynomial number of terms where α(a i−1 ) is the forward probability, calculated using the forward algorithm (Rabiner, 1989) with α(a 0 , y 0 ) = 1, and p(a is the initial alignment distribution.For simplicity, we drop y <i and x in p(y i | a i ) and p(a i | a i−1 ).For completeness, we include the recursive definition of the forward probability: Decoding at test time, however, is hard and we resort to a greedy scheme, described in Alg. 1.To see why it is hard, note that the dependence on y <i means that we have a neural language model scoring the target string as it is being transduced.The dependence is unbounded so there is no dynamic program that allows for efficient computation.
A Note on EOS.In the discussion above, we have suppressed the generation of EOS in the autoregressive models we derive for brevity.For example, p(y i | a i , y <i , x) must be a conditional distribution over Σ y ∪ {EOS} in order for p(y | x) to be a well-defined probability distribution.

A Neural Parameterization with Enforced Monotonicity
The goal of this section is to take the 1 st -order model of §2 and show how we can straightforwardly enforce the monotonicity of the alignments.We will achieve this by adding structural zeros to the distribution, which will still allow us to perform efficient inference with dynamic programming.We follow the neural parameterization of Wu et al. (2018).The source string x is represented by a sequence of character embedding vectors, which are fed into an encoder bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to produce hidden state representations h e j .The emission distribution p(y i | a i , y <i , x) depends on these encodings h e j and the decoder hidden states produced by where e d encodes target characters into character embeddings.The tag embedding h t is produced by where e t maps the tag t k into tag embedding h t k ∈ R dt or zero vector 0 ∈ R dt , depends on whether the tag t k is presented.Note that Y ∈ R dt×|Σt| dt is a learned parameter matrix.Also, h e The Emission Distribution.All of our hardattention models employ the same emission distribution parameterization, which we define below where V ∈ R 3d h ×3d h and W ∈ R |Σy|×3d h are learned parameters.
0 th -order Hard Attention.In the case of the 0 th -order model, the distribution is computed by a bilinear attention function with Eq. ( 1) where T ∈ R d h ×2d h is a learned parameter and A i is a random variable range over the values of the i th alignment.
0 th -order Hard Monotonic Attention.We may enforce string monotonicity by zeroing out any non-monotonic alignment without adding any additional parameters, which can be done by adding structural zeros to the distribution as follows These structural zeros prevent the alignments from jumping backwards during transduction and, thus, enforce monotonicity.The parameterization is identical to the 0 th -order model up to the enforcement of the hard constraint with eq. ( 2).
Algorithm 1 Greedy decoding.(N is the maximum length of the target string.) 1: for i = 1, . . ., N do 2: if i = 1 then 3: return y * 1 st -order Hard Monotonic Attention.We may also generalize the 0 th -order case by adding more parameters.This will equip the model with a more expressive transition function.In this case, we take the 1 st -order hard attention to be an offset-based transition distribution similar to Wang et al. (2018): where ∆ = a i −a i−1 is relative distance to previous attention position, U ∈ R (w+1)×2d h is a learned parameter, and w ∈ N is an integer hyperparameter.Note that, as before, we also enforce monotonicity as a hard constraint in this parameterization.

Related Work
There have been previous attempts to look at monotonicity in neural transduction.Graves (2012) first introduced the monotonic neural transducer for speech recognition.Building on this, Yu et al. (2016) proposes using a separated shift/emit transition distribution to allow a more expressive model.Like us, they also consider morphological inflection and outperform a (weaker) soft attention baseline.Rastogi et al. (2016) offer a neural parameterization of a finite-state transducer, which implicitly encodes monotonic alignments.Instead of learning the alignments directly, Aharoni and Goldberg (2017) take the monotonic alignments from an external model (Sudoh et al., 2013) and train the neural model with these alignments.In follow-up work, Makarov et al. (2017) show this two-stage approach to be effective, winning the CoNLL-SIGMORPHON 2017 shared task on morphological inflection (Cotterell et al., 2017).Raffel et al. (2017) propose a stochastic monotonic transition process to allow sample-based online decoding.

Experimental Findings
Finding #1: Morphological Inflection.The first empirical finding in our study is that we achieve single-model, state-of-the-art performance on the CoNLL-SIGMORPHON 2017 shared task dataset.
The results are shown in Tab. 2. We find that the 1-MONO ties with the 0-MONO system, indicating the additional parameters do not add much.Both of these monotonic systems surpass the nonmonotonic system 0-HARD and SOFT.We also compare to other top systems at the task in Tab. 1.
The previous state-of-the-art model, Bergmanis et al. (2017), is a non-monotonic system that outperformed the monotonic system of Makarov et al. (2017).However, Makarov et al. ( 2017) is a pipeline system that took alignments from an existing aligner; such a system has no manner, by which it can recover from poor initial alignment.We show that jointly learning monotonic alignments leads to improved results.
The second finding is that by comparing SOFT, 0-HARD, 0-MONO in Tab. 2, we observe 0-MONO outperforms 0-HARD and 0-HARD in turns outperforms SOFT in all three tasks.This shows that monotonicity should be enforced strictly since strict monotonicity does not hurt the model.We contrast this to the findings of Wu et al. (2018), who found the non-monotonic models outperform the monotonic ones; this suggests strict monotonicity is more helpful when the model is allowed to learn the alignment distribution jointly.
Finding #3: Do Additional Parameters Help?
The third finding is that 1-MONO has a more expressive transition distribution and, thus, outperforms 0-MONO and 0-HARD in G2P.However, it performs as well as or worse on the other tasks.This tells us that the additional parameters are not always necessary for improved performance.Rather, it is the hard constraint that matters-not the more expressive distribution.However, we remark that enforcing the monotonic constraint does come at an additional computational cost.

Conclusion
We expand the hard-attention neural sequenceto-sequence model of Wu et al. (2018) to enforce monotonicity.We show, empirically, that enforcing monotonicity in the alignments found by hard attention models helps significantly, and we 4 Some numbers were obtained by contacting authors.achieve state-of-the-art performance on the morphological inflection using data from the CoNLL-SIGMORPHON 2017 shared task.We isolate the effect of monotonicity in a controlled experiment and show monotonicity is a useful hard constraint for three tasks, and speculate previous underperformance is due to a lack of joint training.

Figure 1 :
Figure 1: Example of source and target string for each task.Tag guides transduction in morphological inflection.
Figure2: Our monotonic hard-attention model viewed as a graphical model.The circular nodes are random variables and the diamond nodes are deterministic variables.We have omitted arcs from x to y 1 , y 2 , y 3 and y 4 for clarity (to avoid crossing arcs).
Thus, computation of the likelihood in our 1 st -order hard attention model is O(|x| 2 • |y| • |Σ y |) by the dynamic program given in the paper.

Table 1 :
Average dev performance on morphological inflection of our models against single models from the 2017 shared task.All systems are single model, i.e., without ensembling.Why dev?No participants submitted single-model systems for evaluation on test and the best systems were not open-sourced, constraining our comparison.Note we report numbers from their paper.4