On Biasing Transformer Attention Towards Monotonicity

Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining. In this work, we introduce a monotonicity loss function that is compatible with standard attention mechanisms and test it on several sequence-to-sequence tasks: grapheme-to-phoneme conversion, morphological inflection, transliteration, and dialect normalization. Experiments show that we can achieve largely monotonic behavior. Performance is mixed, with larger gains on top of RNN baselines. General monotonicity does not benefit transformer multihead attention, however, we see isolated improvements when only a subset of heads is biased towards monotonic behavior.


Introduction
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has focused on learning monotonic attention behavior either through specialized attention functions (Aharoni and Goldberg, 2017;Raffel et al., 2017;Wu and Cotterell, 2019) or pretraining (Aji et al., 2020). However, it is non-trivial to port specialized attention functions to different models, and recently, Yolchuyeva et al. (2019); Wu et al. (2021) found that a transformer model (Vaswani et al., 2017) outperforms previous work on monotone tasks such as grapheme-to-phoneme conversion, despite having no mechanism that biases the model towards monotonicity.
In the transformer, it is less straightforward to what extent individual encoder states, especially in deeper layers, still represent distinct source inputs after passing through several self-attention layers. Consequently, it is unclear whether enforcing monotonicity in the transformer is as beneficial as for recurrent neural networks (RNNs).
In this paper, we investigate the following research questions: 1. How can we incorporate a monotonicity bias into attentional sequence-to-sequence models such as the transformer?
2. To what extent does a transformer model benefit from such a bias?
Specifically, we want to incorporate a monotonicity bias in a way that is agnostic of the task and model architecture, allowing for its application to different sequence-to-sequence models and tasks. To this end, we introduce a loss function that measures and rewards monotonic behavior of the attention mechanism. 1 We perform experiments and analysis on a variety of sequence-to-sequence tasks where we expect the alignment between source and target to be highly monotonic, such as grapheme-to-phoneme conversion, transliteration, morphological inflection, and dialect normalization and compare our results to previous work that successfully applied hard monotonic attention to recurrent sequence-tosequence models for these tasks (Wu et al., 2018a;Wu and Cotterell, 2019).
Our results show that a monotonicity bias learned through a loss function is capable of making the soft attention between source and target highly monotonic both in RNNs and the transformer. We find that this leads to a similar improvement to previous works on hard monotonic attention for RNNs, whereas for transformer models, the results are mixed: Biasing all attention heads towards monotonicity may limit the representation power of multihead attention in a way that is harmful even for monotonic sequence-tosequence tasks. However, for some tasks, we see small improvements when limiting monotonicity to only a subset of heads.

Related Work
Attention models (Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017) are a very powerful and flexible mechanism to learn the relationship between source and target sequences, but the flexibility might come at the cost of making the relationship harder to learn. Previous work has shown that their performance can be improved by introducing inductive biases. Cohn et al. (2016) introduce various structural alignment biases into a neural machine translation model, including a positional bias. While this bias is motivated by the fact that a given token in the source often aligns with a target token at a similar relative position, it does not explicitly encourage monotonicity.
In contrast, Raffel et al. (2017) propose to modify the attention mechanism to learn hard monotonic alignments instead of computing soft attention over the whole source sequence. Several extensions have been proposed: having a pointer monotonically move over the source sequence and computing soft attention on a local window (Chiu and Raffel, 2018) or from the beginning of the sequence up to the pointer (Arivazhagan et al., 2019). For tasks like simultaneous translation and automatic speech recognition, the main benefit from hard monotonic attention is that decoding becomes faster and can be done in an online setting. However, many sequence-to-sequence tasks behave roughly monotonic and biasing the attention towards monotonicity can improve performance; especially in low-resource settings. Aharoni and Goldberg (2017) show that hard monotonic attention works well for morphological inflection if it mimics an external alignment. Wu et al. (2018b) propose a probabilistic latentvariable model for hard but non-monotonic attention which Wu and Cotterell (2019) later extend to exact hard monotonic attention. In contrast to Aharoni and Goldberg (2017), the alignment is learned jointly with the model. Their approach outperforms several other models on grapheme-to-phoneme conversion, transliteration, and morphological inflection. Monotonic attention has also improved tasks such as summarization (Chung et al., 2020) and morphological analysis (Hwang and Lee, 2020).
Recently, the transformer architecture (Vaswani et al., 2017) has outperformed RNNs in lowresource settings for character-level transduction tasks (Yolchuyeva et al., 2019;Wu et al., 2021) and neural machine translation (Araabi and Monz, 2020). While there has been some work on extending the methods of Raffel et al. (2017); Chiu and Raffel (2018); Arivazhagan et al. (2019) to multihead attention , we are not aware of any work that studied monotonicity in transformers for monotonic tasks, such as grapheme-to-phoneme conversion, transliteration, or morphological inflection.
To this end, we propose a model-agnostic monotonicity loss that can seamlessly be integrated into RNNs as well as the transformer. Our monotonicity loss captures how monotone the soft attention behaves during training, while two hyperparameters allow us to control how much monotonicity is enforced. By encouraging monotonicity through a loss instead of a modification of the attention mechanism, our implementation still brings all the benefits of soft attention to tasks where fast, online inference is not paramount and allows us to explore various trade-offs between unconstrained and fully monotonic attention.

Monotonicity Loss
We now introduce our monotonicity loss function. The loss function is differentiable and compatible with standard soft attention mechanisms and is thus easy to integrate into popular encoder-decoder architectures such as the transformer. On a high level, we compare the attention distribution between decoder time steps in a pairwise fashion and measure whether the mean attended position increases for each pair.
Let us denote the input sequence as X = (x 1 , ..., x |X| ), and the output sequence as Y = (y 1 , ..., y |Y | ). The interface between the encoder and decoder is one or several attention mechanisms. In its general form, the attention mechanism computes some energy e ij between a decoder state at time step i and an encoder state j. While this energy function varies, with popular choices being a feedforward network (Bahdanau et al., 2015) or (scaled) dot-product (Luong et al., 2015;Vaswani et al., 2017), they are typically normalized to a vector of attention weights α using the softmax function: These attention weights are then applied to obtain a weighted average c i of a vector of value states V : For our monotonicity loss, we also compute the mean attended positionā i : We can then define the monotonicity loss in a pairwise fashion, comparing the mean attended position at time steps i and i + 1: δ is a hyperparameter that controls how deviations from the main diagonal are penalized. Let us first consider the case with δ = 0: ifā i+1 ≥ā i for all positions i, i.e. if the mean attended position is weakly increasing 2 , then the loss is 0. Any decrease in the mean attended position will incur a cost that is proportional to the amount of decrease, relative to the source sequence length; 3 this allows differentiation of the loss, and will also serve as a measure of the degree of monotonicity in the analysis.
We might want to bias the model towards strictly monotonic behavior, penalizing it ifā remains unchanged over several time steps. We can achieve this by incurring a loss ifā does not increase by some margin, controlled by δ. At the most extreme, with δ = 1, the loss is minimized if the mean attended position follows the main diagonal of the alignment matrix, increasing by |X| |Y | at each time step. Figure 1 shows how the margin δ can influence the monotonicity loss with some examples.
In equation 4, costs are later summed over the target sequence. In practice, we normalize the cost by the number of tokens in a batch for training stability, as is typically done for the cross-entropy loss. If a model has multiple attention mechanisms, e.g. attention in multiple layers, or multihead attention, we separately compute the loss for each attention mechanism, then average the losses. We can also just apply the loss to a subset of attention mechanisms, allowing different attention heads to learn specialized behavior (Voita et al., 2019).

Models and Data
We implement the loss function in sockeye (Hieber et al., 2018), and experiment with RNN and transformer models. We list the specific baseline settings for each task in Appendix A.2.
The monotonic loss function is controlled by a hyperparameter for the margin (δ), and an additional scaling factor for the loss itself (λ). Preliminary experiments have shown that the monotonicity loss has an undesirable interaction with attention dropout, which is commonly used in transformer models. Randomly dropping attention connections during training makes it harder to reliably avoid a decrease in the mean attended position, favoring a degenerate local optimum where attention resides constantly on the first (or last) encoder state. To avoid this problem, we use DropHead (Zhou et al., 2020) instead, which has a similar regularizing effect as attention dropout, but does not interact with the monotonicity loss. In addition to the standard evaluation metrics used in each task, we provide the monotonicity loss on the test set and the percentage of target tokens for which the average source attention position has increased (by some margin).
We perform experiments on three word-level and one sentence-level sequence-to-sequence tasks:

Grapheme-to-Phoneme Conversion
For grapheme-to-phoneme conversion, we use NETtalk (Sejnowski and Rosenberg, 1987) 4 and CMUdict, 5 two datasets for English, with the same data split as Wu and Cotterell (2019). For experiments with RNN models, we follow the settings in Wu et al. (2018b) (large configuration). 6 For experiments with transformer models, we follow the settings suggested in Wu et al. (2021), however, we use dropout rates of 0.3 (NETtalk) and 0.2 (CMUdict) instead of 0.1 and 0.3. Furthermore, we use a smaller feed-forward dimension for the NETtalk models (512 instead of 1024), since this a relatively small dataset (∼14k samples).
For both RNN and transformer models, we use early stopping with phoneme error rate, as opposed to a minimum learning rate value as in Wu et al. (2018b) and Wu et al. (2021). We evaluate our models with word error rate (WER) and phoneme error rate (PER).

Morphological Inflection
For morphological inflection, we use the CoNLL-SIGMORPHON 2017 shared task dataset. 7 We choose all 51 languages from the high-resource setting where the training data for each language consists of 10,000 morphological tags + lemma and inflected form pairs (except for Bengali and Haida which have 4,243 and 6,840 pairs respectively) and from the medium-resource setting with 1,000 training examples per language. Our baselines performed very poorly on the low-resource setting with only 100 training examples and we decided to focus on the other two tasks instead. We preprocess the data to insert a separator token between the morphological tags and the input lemma. The monotonicity loss is then only computed on the positions to the right of the separator token's position. We follow Wu et al. (2021) and use special positional encodings for the morphological tags in the transformer. Unlike their approach, where the position for all tags was set to 0, we set the position of the separator token to 0 and sequentially decrease the positions of the morphological tags to the left (Figure 2). This serves to stabilize the positional encodings of the lemma tokens, while still accounting for the fixed order of morphological tags in the dataset. In preliminary experiments, we observed an improvement of 0.63% in accuracy over vanilla positional encodings.
We train models on character-level for morphological inflection following the previously recommended settings for RNNs in Wu et al. (2018b) and for transformers in Wu et al. (2021) (except for reducing the feed-forward dimension to 512 instead of 1024). For the high resource datasets, we use a batch size of 400, for the medium resource datasets 200. Early stopping is done in the same way as for grapheme-to-phoneme conversion. We use the official evaluation script to compute word-level accuracy (ACC) and character-level edit distance (LEV).

Transliteration
For transliteration, we experiment on the NEWS2015 shared task data (Zhang et al., 2015) and use the same subset of 11 script pairs that Wu and Cotterell (2019)  We add all possible pairs to our training data, which only has a large effect on EN-AR, where there are on average 10 acceptable transliterations per source name. Since the references of the official shared task test sets were not released, we follow Wu and Cotterell (2019) and use the development set as our test set. We randomly sample 1,000 names from the training sets as our development sets for script pairs with more than 20,000 training examples and 100 for script pairs with fewer training examples.
Again, we follow Wu et al. (2018b) for hyperparameters in RNNs and Wu et al. (2021) in transformers (smaller feed-forward dimensions of 512). We early stop training as for grapheme-to-phoneme conversion. We evaluate our models following Zhang et al. (2015) and compute word-level accuracy (ACC) and character-level mean F-score (MFS). The formula for MFS is in Appendix A.1.

Dialect Normalization
For this work, we consider dialect normalization as a machine translation task from dialect to standard. We work with the dataset described in Aepli and Clematide (2018), which consists of 26,015 crowd-sourced German translations of 6,197 original Swiss German sentences. We use three documents (10%) as test sets and randomly split the rest in development and training set (10% and 80% respectively). The alignment between Swiss German and the German translations is highly monotonic, but there are occasional word order differences, as es isch aber als Kompliment gmeint gsi es war aber als Kompliment gemeint it was however as compliment meant Figure 3: Swiss-German to German dialect normalization example with verb reordering. illustrated in Figure 3.
The models are trained on subwords obtained via BPE (Sennrich et al., 2016), created with subwordnmt computing 2000 merges. We treat this as a lowresource machine translation task, and thus follow hyperparameters by Sennrich and Zhang (2019) for the RNN models, while the transformer models are trained according to Araabi and Monz (2020). We evaluate our models with BLEU (Papineni et al., 2002). 8

Results
In addition to task-specific evaluation metrics, we use the loss function to score the monotonicity of the attention on the test set for all models (reported as L M ON O ). Furthermore, we report the percentage of decoding states for which the average source attention positionā increases by at least δ |X| |Y | as %mono. In other words, this is the percentage of states for which the pairwise loss is 0.

Grapheme-to-Phoneme Conversion
We test different settings on the grapheme-tophoneme task, see Table 1 for results with RNNs (top) and transformers (bottom). We find that models trained with the additional loss have more monotonic attention than the baselines (see %mono and L M ON O ). We observe large differences both in terms of WER and PER across multiple runs for the baseline, especially for the small data set. 9 We therefore report the average result of three runs with standard deviations for each model.
Attention in the RNN baselines is already quite monotonic, but we observe small improvements with δ = 0.5. For transformer models, on the other hand, δ > 0 seems to harm the performance, therefore we only report results with δ = 0. In general, multihead attention in the transformer does not seem to benefit much from enforced monotonicity.

Morphological Inflection
For morphological inflection, we show the average results over all 51 languages in Table 2. Our RNN baseline is slightly better than previous work, whereas our transformer baseline performs slightly worse. We notice that the transformer models trained with δ = 0 on the morphological inflection tasks result in the model always attending to the same source position at every decoding state. We therefore set δ to 0.1 for transformer models trained on this task. For the remaining tasks, we report results with δ set to 0 and λ always set to 0.1 so as not to overfit hyperparameters on each task. The baseline monotonicity loss for this task is higher than for grapheme-to-phoneme conversion but training with the monotonicity loss can drastically increase the monotonicity of the attention mechanisms. This can be seen both in the lower monotonicity score and the higher percentage of decoding states where the average source attention position increases from the previous state. In terms of performance, we do not see an improvement over the baselines.

Transliteration
Our results for transliteration are shown in Table  3 (average over all 11 datasets). Again, we can see that the monotonicity loss effectively biases the attention towards a more monotonic behavior, decreasing the monotonicity score and increasing the percentage of decoding states where the average source attention position increases. In terms of performance, there is a small gain for RNNs both in word-level accuracy and character-level mean F-score. Training with the monotonicity loss does not improve the performance of the transformer compared to the baseline.

Dialect Normalization
Since dialect normalization is our only sentencelevel sequence-to-sequence task, it is interesting to see how the monotonicity loss works on longer sequences where more reordering is possible compared to the previous tasks. The less monotonic nature of this task is reflected in the fact that neither of our models trained towards monotonicity outperforms the non-monotonic baselines, see Table 3. Dialect normalization is also the only task where the transformer does not outperform the RNN models.

Analysis
Overall, our results show that the proposed monotonicity loss succeeds in making attention more monotonic, but effects on quality are more positive for RNNs than for transformers. We now analyze the proposed loss function in more detail.

Monotonicity Over Time
First, we plot the monotonicity score during training and compare how fast it decreases over time.
We find that the monotonicity score decreases very fast for the models trained with our loss function and then stays rather constant. The baseline models show various behaviors: for some datasets and models, the score decreases over training timesuggesting that the model does learn to attend more monotonically even without the loss. For other data sets, the score is initially lower and increases over training time, and, for some, the score stays more or less constant. What all baselines have in common, is that the monotonicity score oscillates much more than when trained with the monotonicity loss. Figure 4 shows an example plot for the EN-JA transliteration dataset.

Varying Monotonicity
We can vary how much we constrain attention to be monotonic by varying the weight of the monotonicity loss function (λ). We analyze how this influences the performance on dialect normalization. Figure 5 shows that non-monotonic behavior (as defined by the monotonicity loss) can be reduced by a factor of 10-20 with stable or even slightly improving performance. However, BLEU drops drastically for large λ. This highlights the advantage of our loss function over hard monotonic attention. Through λ we can regulate the degree of monotonicity in the attention mechanism, which can be beneficial for tasks where hard monotonic attention would be too strict.

Monotonicity Loss on Single Heads
Since we calculate the loss on each attention component separately, we can also limit its application to specific layers and heads (in the case of multihead attention). We test how restricting the monotonic behavior to only one head per layer influences the performance of the transformer on our chosen tasks. Results are presented in Table 4. We find that monotonicity on only one head generally improves performance compared to on all heads, except for dialect normalization. For grapheme-tophoneme conversion and morphological inflection in the medium resource setting, we even see performance gains over the baseline. Our results support the belief that the flexibility of multihead attention is key to the success of the transformer. If applied to all heads, the monotonicity loss reduces variability in the attention distribution of the different heads, i.e. with high λ, all heads attend to the same source position. We suspect that this severely limits the capacity of transformer models and explains why rewarding monotonicity on only one head is beneficial.
These findings are also important in the context of the work by Voita et al. (2019) who find that attention heads tend to learn specialized functions.
Having one monotonic attention head could be a complementary way to encourage more diversity amongst heads, next to disagreement regularization (Li et al., 2018). Indeed, we observe that for grapheme-to-phoneme conversion and dialect normalization the remaining heads trained without the monotonicity loss tend to become less monotonic.

Attention Maps
Attention maps are particularly interesting for dialect normalization where 1) the transformer baseline has one of the highest monotonicity losses of all our models and 2) reordering of source and target tokens is possible. Figure 6 shows the attention maps for our baseline transformer and the corresponding model trained with the monotonicity loss. The bottom sentence is an example where the alignment between the source and the target is monotonic. Here, the baseline does show tentative monotonic behavior but with the monotonicity loss, the attention follows the main diagonal much more closely. The sentence on the top, on the other hand, contains a non-monotonic alignment. For a correct alignment of the past tense of "to be", the model needs to peek at the very last token before the full stop. This is reflected in the baseline attention map where the attention at the second decoding step is highest on the third-to-last source position. However, for our model trained with the monotonicity loss, the attention follows the main diagonal and fails to mirror the correct alignment. Occasional reorderings like this may explain why the monotonicity loss did not work well for this task despite it being largely monotonic.

Conclusion
We propose a model-agnostic loss function that measures and rewards monotonicity and can easily be integrated into various attention mechanisms. To achieve this, we track how monotonically the average position of the attention shifts over the source sequence across time steps. We show that this loss function can be seamlessly integrated into RNNs as well as transformers. Models trained with our monotonicity loss learn largely monotonic behavior without any specific changes to the attention mechanism. While we see some performance gains in RNNs, our results show that biasing all attention heads in transformers towards monotonic behavior is undesirable. However, a bias towards monotonicity may be helpful if applied to only a subset of  Table 4: Transformer results for all tasks with monotonicity on all heads vs. only on one head. Monotonicity loss is computed on all layers. Average over three runs with independent seeds. Our best models are marked in bold. Figure 6: Transformer attention maps for the sentence shown in Figure 3; "but it was meant as compliment" and "we delete this and get to work". Left: baseline (λ=0), right: with monotonicity loss on all heads (λ=0.1).

heads.
For the future, we are interested in more sophisticated schedules for the monotonicity loss, possibly reducing λ over the course of training. This would help to learn monotonic behavior in the early training stages but gives the model more flexibility to deviate from such an attention pattern if needed. In this context, our loss function could also be used as an additional pretraining objective for transfer to very low-resource tasks. We would also like to test our loss function on tasks where the alignment may be harder to learn, for example in multimodal models or for long sequences. Finally, using our loss function as a way to measure monotonicity could be an interesting tool for interpretability research.

A.1 Character-level Mean F-score (MFS)
LCS(c i , r i ) = 1 2 (|c i | + |r i | − ED(c i , r i )) Where c i is the i-th candidate and r i is the corresponding reference transliteration with the smallest edit distance (ED).