Sparse and Constrained Attention for Neural Machine Translation

In neural machine translation, words are sometimes dropped from the source or generated repeatedly in the translation. We explore novel strategies to address the coverage problem that change only the attention transformation. Our approach allocates fertilities to source words, used to bound the attention each word can receive. We experiment with various sparse and constrained attention transformations and propose a new one, constrained sparsemax, shown to be differentiable and sparse. Empirical evaluation is provided in three languages pairs.


Introduction
Neural machine translation (NMT) emerged in the last few years as a very successful paradigm (Sutskever et al., 2014;Bahdanau et al., 2014;Gehring et al., 2017;Vaswani et al., 2017). While NMT is generally more fluent than previous statistical systems, adequacy is still a major concern (Koehn and Knowles, 2017): common mistakes include dropping source words and repeating words in the generated translation.
Previous work has attempted to mitigate this problem in various ways. Wu et al. (2016) incorporate coverage and length penalties during beam search-a simple yet limited solution, since it only affects the scores of translation hypotheses that are already in the beam. Other approaches involve architectural changes: providing coverage vectors to track the attention history (Mi et al., 2016;Tu et al., 2016), using gating architectures and adaptive attention to control the amount of source context provided (Tu et al., 2017a;Li and Zhu, 2017), or adding a reconstruction loss (Tu et al., 2017b). Feng et al. (2016) also use the notion of fertility * Work done during an internship at Unbabel. implicitly in their proposed model. Their fertility conditioned decoder uses a coverage vector and an extract gate which are incorporated in the decoding recurrent unit, increasing the number of parameters.
In this paper, we propose a different solution that does not change the overall architecture, but only the attention transformation. Namely, we replace the traditional softmax by other recently proposed transformations that either promote attention sparsity (Martins and Astudillo, 2016) or upper bound the amount of attention a word can receive (Martins and Kreutzer, 2017). The bounds are determined by the fertility values of the source words. While these transformations have given encouraging results in various NLP problems, they have never been applied to NMT, to the best of our knowledge. Furthermore, we combine these two ideas and propose a novel attention transformation, constrained sparsemax, which produces both sparse and bounded attention weights, yielding a compact and interpretable set of alignments. While being in-between soft and hard alignments (Figure 2), the constrained sparsemax transformation is end-to-end differentiable, hence amenable for training with gradient backpropagation.
To sum up, our contributions are as follows: 1 • We formulate constrained sparsemax and derive efficient linear and sublinear-time algorithms for running forward and backward propagation. This transformation has two levels of sparsity: over time steps, and over the attended words at each step.
• We provide a detailed empirical comparison of various attention transformations, including softmax (Bahdanau et al., 2014), sparse-max (Martins and Astudillo, 2016), constrained softmax (Martins and Kreutzer, 2017), and our newly proposed constrained sparsemax. We provide error analysis including two new metrics targeted at detecting coverage problems.

Preliminaries
Our underlying model architecture is a standard attentional encoder-decoder (Bahdanau et al., 2014). Let x := x 1:J and y := y 1:T denote the source and target sentences, respectively. We use a Bi-LSTM encoder to represent the source words as a matrix H := [h 1 , . . . , h J ] ∈ R 2D×J . The conditional probability of the target sentence is given as where p(y t | y 1:(t−1) , x) is computed by a softmax output layer that receives a decoder state s t as input. This state is updated by an auto-regressive LSTM, s t = RNN(embed(y t−1 ), s t−1 , c t ), where c t is an input context vector. This vector is computed as c t := Hα t , where α t is a probability distribution that represents the attention over the source words, commonly obtained as where z t ∈ R J is a vector of scores. We follow Luong et al. (2015) and define z t,j := s t−1 W h j as a bilinear transformation of encoder and decoder states, where W is a model parameter. 2

Sparse and Constrained Attention
In this work, we consider alternatives to Eq. 2. Since the softmax is strictly positive, it forces all words in the source to receive some probability mass in the resulting attention distribution, which can be wasteful. Moreover, it may happen that the decoder attends repeatedly to the same source words across time steps, causing repetitions in the generated translation, as Tu et al. (2016) observed. With this in mind, we replace Eq. 2 by α t = ρ(z t , u t ), where ρ is a transformation that may depend both on the scores z t ∈ R J and on upper bounds u t ∈ R J that limit the amount of attention that each word can receive. We consider three alternatives to softmax, described next.
Sparsemax. The sparsemax transformation (Martins and Astudillo, 2016) is defined as: In words, it is the Euclidean projection of the scores z onto the probability simplex. These projections tend to hit the boundary of the simplex, yielding a sparse probability distribution. This allows the decoder to attend only to a few words in the source, assigning zero probability mass to all other words. Martins and Astudillo (2016) have shown that the sparsemax can be evaluated in O(J) time (same asymptotic cost as softmax) and gradient backpropagation takes sublinear time (faster than softmax), by exploiting the sparsity of the solution.
Constrained softmax. The constrained softmax transformation was recently proposed by Martins and Kreutzer (2017) in the context of easy-first sequence tagging, being defined as follows: where u is a vector of upper bounds, and KL(. .) is the Kullback-Leibler divergence. In other words, it returns the distribution closest to softmax(z) whose attention probabilities are bounded by u. Martins and Kreutzer (2017) have shown that this transformation can be evaluated in O(J log J) time and its gradients backpropagated in O(J) time.
To use this transformation in the attention mechanism, we make use of the idea of fertility (Brown et al., 1993). Namely, let β t−1 := t−1 τ =1 α τ denote the cumulative attention that each source word has received up to time step t, and let f := (f j ) J j=1 be a vector containing fertility upper bounds for each source word. The attention at step t is computed as Intuitively, each source word j gets a credit of f j units of attention, which are consumed along the decoding process. If all the credit is exhausted, it receives zero attention from then on. Unlike the sparsemax transformation, which places sparse attention over the source words, the constrained softmax leads to sparsity over time steps. respectively. For constrained softmax/sparsemax, we set unit fertilities to every word; for each row the upper bounds (represented as green dashed lines) are set as the difference between these fertilities and the cumulative attention each word has received. The last row illustrates the cumulative attention for the three words after all rounds.
Constrained sparsemax. In this work, we propose a novel transformation which shares the two properties above: it provides both sparse and bounded probabilities. It is defined as: The following result, whose detailed proof we include as supplementary material (Appendix A), is key for enabling the use of the constrained sparsemax transformation in neural networks.
Proposition 1 Let α = csparsemax(z; u) be the solution of Eq. 6, and define the sets Then: • Forward propagation. α can be computed in O(J) time with the algorithm of Pardalos and Kovoor (1990) (Alg. 1 in Appendix A). The solution takes the form α j = max{0, min{u j , z j − τ }}, where τ is a normalization constant.

Fertility Bounds
We experiment with three ways of setting the fertility of the source words: CONSTANT, GUIDED, and PREDICTED. With CONSTANT, we set the fertilities of all source words to a fixed integer value f . With GUIDED, we train a word aligner based on IBM Model 2 (we used fast align in our experiments, Dyer et al. (2013)) and, for each word in the vocabulary, we set the fertilities to the maximal observed value in the training data (or 1 if no alignment was observed). With the PRE-DICTED strategy, we train a separate fertility predictor model using a bi-LSTM tagger. 3 At training time, we provide as supervision the fertility estimated by fast align. Since our model works  with fertility upper bounds and the word aligner may miss some word pairs, we found it beneficial to add a constant to this number (1 in our experiments). At test time, we use the expected fertilities according to our model.

Sink token.
We append an additional <SINK> token to the end of the source sentence, to which we assign unbounded fertility (f J+1 = ∞). The token is akin to the null alignment in IBM models. The reason we add this token is the following: without the sink token, the length of the generated target sentence can never exceed j f j words if we use constrained softmax/sparsemax. At training time this may be problematic, since the target length is fixed and the problems in Eqs. 4-6 can become infeasible. By adding the sink token we guarantee j f j = ∞, eliminating the problem.
Exhaustion strategies. To avoid missing source words, we implemented a simple strategy to encourage more attention to words with larger credit: we redefine the pre-attention word scores as z t = z t + cu t , where c is a constant (c = 0.2 in our experiments). This increases the score of words which have not yet exhausted their fertility (we may regard it as a "soft" lower bound in Eqs. 4-6).

Experiments
We evaluated our attention transformations on three language pairs. We focused on small datasets, as they are the most affected by coverage mistakes. We use the IWSLT 2014 corpus for DE-EN, the KFTT corpus for JA-EN (Neubig, 2011), and the WMT 2016 dataset for RO-EN. The training sets have 153,326, 329,882, and 560,767 parallel sentences, respectively. Our reason to prefer smaller datasets is that this regime is what brings more adequacy issues and demands more structural biases, hence it is a good test bed for our methods. We tokenized the data using the Moses scripts and preprocessed it with subword units (Sennrich et al., 2016) with a joint vocabulary and 32k merge operations. Our implementation was done on a fork of the OpenNMT-py toolkit (Klein et al., 2017) with the default parameters 4 . We used a validation set to tune hyperparameters introduced by our model. Even though our attention implementations are CPU-based using NumPy (unlike the rest of the computation which is done on the GPU), we did not observe any noticeable slowdown using multiple devices. As baselines, we use softmax attention, as well as two recently proposed coverage models: • COVPENALTY (Wu et al., 2016, §7). At test time, the hypotheses in the beam are rescored with a global score that includes a length and a coverage penalty. 5 We tuned α and β with grid search on {0.2k} 5 k=0 , as in Wu et al. (2016). • COVVECTOR (Tu et al., 2016). At training and test time, coverage vectors β and additional parameters v are used to condition the next attention step. We adapted this to our bilinear attention by defining z t,j = s t−1 (W h j + vβ t−1,j ). We also experimented combining the strategies above with the sparsemax transformation.
As evaluation metrics, we report tokenized BLEU, METEOR (Denkowski and Lavie (2014), as well as two new metrics that we describe next to account for over and under-translation. 6 4 We used a 2-layer LSTM, embedding and hidden size of 500, dropout 0.3, and the SGD optimizer for 13 epochs. 5 Since our sparse attention can become 0 for some words, we extended the original coverage penalty by adding another parameter , set to 0.1: cp(x; y) := β J j=1 log max{ , min{1, |y| t=1 αjt}}. 6 Both evaluation metrics are included in our software package at www.github.com/Unbabel/ sparse constrained attention.  REP-score: a new metric to count repetitions. Formally, given an n-gram s ∈ V n , let t(s) and r(s) be the its frequency in the model translation and reference. We first compute a sentence-level score The REP-score is then given by summing σ(t, r) over sentences, normalizing by the number of words on the reference corpus, and multiplying by 100. We used n = 2, λ 1 = 1 and λ 2 = 2.
DROP-score: a new metric that accounts for possibly dropped words. To compute it, we first compute two sets of word alignments: from source to reference translation, and from source to the predicted translation. In our experiments, the alignments were obtained with fast align (Dyer et al., 2013), trained on the training partition of the data. Then, the DROP-score computes the percentage of source words that aligned with some word from the reference translation, but not with any word from the predicted translation. Table 1 shows the results. We can see that on average, the sparse models (csparsemax as well as sparsemax combined with coverage models) have higher scores on both BLEU and METEOR. Generally, they also obtain better REP and DROP scores than csoftmax and softmax, which suggests that sparse attention alleviates the problem of coverage to some extent.
To compare different fertility strategies, we ran experiments on the DE-EN for the csparsemax transformation (Table 2). We see that the PRE-DICTED strategy outperforms the others both in terms of BLEU and METEOR, albeit slightly. Figure 2 shows examples of sentences for which the csparsemax fixed repetitions, along with the corresponding attention maps. We see that in the case of softmax repetitions, the decoder attends repeatedly to the same portion of the source sentence (the expression "letzten hundert" in the first sentence and "regierung" in the second sentence). Not only did csparsemax avoid repetitions, but it also yielded a sparse set of alignments, as expected. Appendix B provides more examples of translations from all models in discussion.

Conclusions
We proposed a new approach to address the coverage problem in NMT, by replacing the softmax attentional transformation by sparse and constrained alternatives: sparsemax, constrained softmax, and the newly proposed constrained sparsemax. For the latter, we derived efficient forward and backward propagation algorithms. By incorporating a model for fertility prediction, our attention transformations led to sparse alignments, avoiding repeated words in the translation.