Surprisingly Easy Hard-Attention for Sequence to Sequence Learning

In this paper we show that a simple beam approximation of the joint distribution between attention and output is an easy, accurate, and efficient attention mechanism for sequence to sequence learning. The method combines the advantage of sharp focus in hard attention and the implementation ease of soft attention. On five translation tasks we show effortless and consistent gains in BLEU compared to existing attention mechanisms.


Introduction
In structured input-output models as used in tasks like translation and image captioning, the attention variable decides which part of the input aligns to the current output. Many attention mechanisms have been proposed (Xu et al., 2015;Bahdanau et al., 2014;Luong et al., 2015;Martins and Astudillo, 2016) but the de facto standard is a soft attention mechanism that first assigns attention weights to input encoder states, then computes an attention weighted 'soft' aligned input state, which finally derives the output distribution. This method is end to end differentiable and easy to implement.
Another less popular variant is hard attention that aligns each output to exactly one input state but requires intricate training to teach the network to choose that state. When successfully trained, hard attention is often found to be more accurate (Xu et al., 2015;Zaremba and Sutskever, 2015). In NLP, a recent success has been in a monotonic hard attention setting in morphological inflection tasks (Yu et al., 2016;Aharoni and Goldberg, 2017). For general seq2seq learning, methods like Sparse-Max (Martins and Astudillo, 2016) and local attention (Luong et al., 2015) were proposed to bridge the gap between soft and hard attention. * Both authors contributed equally to this work In this paper we propose a surprisingly simpler alternative based on the original joint distribution between output and attention, of which existing soft and hard attention mechanisms are approximations. The joint model couples input states individually to the output like in hard attention, but it combines the advantage of end-to-end trainability of soft attention. When the number of input states is large, we propose to use a simple approximation of the full joint distribution called Beam-joint. This approximation is also easily trainable and does not suffer from the high variance of Monte-Carlo sampling gradients of hard attention. We evaluated our model on five translation tasks and increased BLEU by 0.8 to 1.7 over soft attention, which in turn was better than hard and the recent Sparsemax (Martins and Astudillo, 2016) attention. More importantly, the training process was as easy as soft attention. For further support, we also evaluate on two morphological inflection tasks and got gains over soft and hard attention.

Background and Related Work
For sequence to sequence (seq2seq) learning the encoder-decoder model is the standard and we review it here. We then review related work on attention mechanisms on these models.

Attention-based Encoder Decoder Model
Let x 1 , . . . , x m denote the tokens in the input sequence that have been transformed by an encoder network to state vectors x 1 , . . . , x m , which we jointly denote as x 1...m . Let y 1 , . . . , y n denote the output tokens in the target sequence. The Encoder-Decoder (ED) network factorizes Pr(y 1 , . . . , y n |x 1...m ) as n t=1 Pr(y t |x 1...m , s t ) where s t is a decoder state summarizing y 1 , . . . y t−1 . For each t, a hidden attention variable a t is used to denote which part of x 1...m aligns with y t . Let P (a t = j|x 1...m , s t ) denote the probability that encoder state x j is relevant for output y t . Typically this is estimated using a softmax function over attention scores computed from x j and decoder state s t as follows.
where A θ (., .) is the attention unit that scores each input state x j as per the decoder state s t . Thereafter, in the popular soft-attention mechanism, the attention weighted sum of the input states is used to model log likelihood for each y t as where P t (a t = j) is the short form for P (a t = j|x 1...m , s t ). Also, here and in the rest of the paper we drop s t from P (y t ) and P t (a) for ease of notation. The weighted sum a P t (a)x a is called an input context c t which is fed to the decoder RNN along with y t for computing the next state s t+1 .

Related Work
We next review existing attention types.
Soft Attention is the attention method described in the previous section and is the current standard for seq2seq learning (Xu Chen, 2018;Koehn, 2017). It was proposed for translation in (Bahdanau et al., 2014) and refined further in (Luong et al., 2015). As shown in Eq 2, here each output is derived from an attention averaged input. This diffuses the coupling between the input and output. The advantage of soft attention is end to end differentiability, and fast training and inference.
Hard Attention was proposed in its current form in (Xu et al., 2015) and attends to exactly one input state for an output 1 . During training, log-likelihood is an expectation over sampled attentions: whereã 1 , . . . ,ã M are sampled from the multinomial P t (a). Because of the sampling, the gradient has to be computed by Monte Carlo gradient/REINFORCE (Williams, 1992) and is subject to high variance. Many tricks are required to train hard attention and there is little standardization across implementations. Xu et al (2015) use a combination of REINFORCE and soft attention. Zaremba et al(2015) uses curriculum learning that starts as soft-attention and gradually becomes discrete. Ling& Rush (2017) aggregates multiple samples during training, and a single sampled attention while testing. However, once trained well the sharp focus on memory provided by hard-attention has been found to yield superior performance (Xu et al., 2015;Shankar and Sarawagi, 2018).
Sparse/Local Attention Many attempts have been made to bridge the gap between soft and hard attention. Luong et al (2015) proposes local attention that averages a window of input. This has been refined later to include syntax (Chen et al., 2017;Sennrich and Haddow, 2016;. Another idea is to replace the softmax in soft attention with sparsity inducing operators (Martins and Astudillo, 2016; Niculae and Blondel, 2017). However, all sparse/local attention methods continue to compute P (y) from an attention weighted sum of inputs (Eq: 2) unlike hard attention.

Joint Attention-Output Models
We start from an explicit joint representation of the uncertainty of the attention and output variables.
The joint model directly couples individual input states to the output, and thus is a type of hard attention. Also, by taking an expectation, instead of a single hard attention, it enjoys differentiability as in soft-attention. We call this the full-joint method. Unfortunately, either when the vocabulary or the number of encoder states (m) is large, full-joint is not practical. Existing hard and soft attentions can be viewed as its approximations that either marginalize early or hard select attention. We show a surprisingly simple alternative approximation that provides hard attention without its training complexity. Our method called Beam-joint deterministically selects the top-k highest attention values and approximates the full joint log probability as log P t (y t |x 1...m ) ≈ log a∈TopK(Pt(a)) P t (a)P t (y t |x a ) (5) Thus, in beam-joint, we first compute the multinomial attention distribution in O(m) time using Eq 1, select the Top-K input positions from the multinomial, next with hard attention on each position compute K output softmax, and finally compute the attention weighted output mixture distribution. The number of output softmax is K times in normal soft-attention but the actual running time overhead is only 20-30% for translation tasks. We used the default pass-through TopK operator (which is not differentiable) and optimize the beamapproximation directly. We also experimented with a version which smoothly shifts from soft-attention to beam-attention, but found that training the beamapproximation directly leads to best results.
We show empirically that this very simple scheme is surprisingly effective compared to existing hard and soft attention over several translation tasks. Unlike sampling and variational methods that require careful tuning and exotic tricks during training, this simple scheme trains as easily as softattention, without significant increase in training time because even K = 6 works well enough.
Another reason why our 'sum of probabilities' form performs better could be the softmax barrier effect highlighted in (Yang et al., 2018). The authors argue that the richness of natural language cannot be captured in normal softmax due to the low rank constraint it imposes on input-to-output matrix. They improve performance using a Mixture of Softmax model. Our beam-joint also is a mixture of softmax and possibly achieves higher rank than a single softmax. However their mixture requires learning multiple softmax matrices, whereas ours are due to varying attention and we do not learn any extra parameters than soft attention.

Experiments
We compare attention models on two NLP tasks: machine translation and morphological inflection.

Machine translation
We experiment on five language pairs from three datasets: IWSLT15 English↔Vietnamese (Cettolo et al., 2015) which contains 133k train, 1.5k validation(tst2012) and 1.2k test(tst2013) sentence pairs respectively; IWSLT14 German↔English (Cettolo et al., 2014) which contains 160k train, 7.2k validation and 6.7k test sentence pairs respectively ; Workshop on Asian Translation 2017 Japanese→English (Nakazawa et al., 2016) which contains 2M train, 1.8k validation and 1.8k test sentence pairs respectively. We use a 2 layer bi-directional encoder and a 2 layer unidirectional decoder with 512 hidden LSTM units and 0.2 dropout rate with vanilla SGD optimizer. We base our implementation 2 on the NMT code 3 in Tensorflow. We did no special hyper-parameter tuning and used standard-softmax tuned parameters on a batch size of 64.
Comparing attention models We compare beam-joint (default K = 6) with standard soft and hard attention. To further dissect the reasons behind beam-joint's gains, we compare beam-joint with a sampling based approximation of full-joint called Sample-Joint that replaces the TopK in Eq 5 with K attention weighted samples. We train samplejoint as well as hard-attention with REINFORCE with 6-samples. Also to ascertain that our gains are not explained by sparsity alone, we compare with Sparsemax (Martins and Astudillo, 2016).
In Table 1 we show perplexity and BLEU with three beam sizes (B). Beam-joint significantly outperforms all other variants, including the standard soft attention by 0.8 to 1.7 BLEU points. The perplexity shows even a more impressive drop in all five datasets. Also we observe training times for beam-joint to be only 20-30% higher than softattention, establishing that beam-joint is both practical and more accurate.
Sample-joint is much worse than beam-joint. Apart from the problem of high variance of gradients in the reinforce step, another problem is that sampling repeats states whereas TopK in beamjoint gets distinct states. Hard attention too faces training issues and performs worse than soft attention, explaining why it is not commonly used in NMT. Sample-joint is better than Hard attention, further highlighting the merits of the joint distribution. Sparsemax is competitive but marginally worse than soft attention. This is concordant with the recent experiments of (Niculae and Blondel, 2017).
Comparison with Full Joint Next we evaluate the impact of our beam-joint approximation against full-joint and soft attention. Full-joint cannot scale to large vocabularies, therefore we only compare on En-Vi with a batch size of 32. Figure 1a shows final BLEU of these methods as well as BLEU against increasing training steps. Beam-joint both converges faster and to a higher score than soft-  This shows that an attention-beam of size 6 suffices to approximate full joint almost perfectly.
Next, in Figure 1b, we compare beam-joint (solid lines) and soft attention (dotted lines) for convergence rates on three other datasets. For each dataset beam-joint trains faster with a consistent improvement of more than 1 BLEU.

Effect of K in Beam-joint
We show the effect of K used in TopK of beam-joint in Figure 2 on the En-Vi and De-En tasks. On En-Vi BLEU increases from 16.0 to 25.7 to 26.5 as K increases from 1 to 2 to 3; and then saturates quickly. Similar behavior is observed in the other dataset. This shows that small K values like 6 suffice for translation.
We further evaluate whether the performance gain of beam-joint is due to the softmax barrier alone in Table 2. We used our models trained with K=6, and deployed them for test-time greedy decoding with K set to 1. Since the output now has only a single softmax component, this model faces the same bottleneck as soft-attention. One can observe that as expected these results are worse than beam-joint with K=6, however they still exceed soft-attention by a significant margin, demonstrating that the performance gain is not solely due to the effect of ensembling or softmax-barrier.

Morphological Inflection
To demonstrate the use of this approach beyond translation, we next consider two morphological  inflection tasks. We use (Durrett and DeNero, 2013)'s dataset containing 8 inflection forms for German Nouns (de-N) and 27 forms for German Verbs (de-V). The number of training words is 2364 and 1627 respectively while the validation and test words are 200 each. We train a one layer encoder and decoder with 128 hidden LSTM units each with a dropout rate of 0.2 using Adam(Kingma and Ba, 2014) and measure 0/1 accuracy for soft, hard and full-joint attention models. Due to limited input length and vocabulary, we were able to run directly the full-joint model. We also ran the 100 units wide two layer LSTM with hard-monotonic attention provided by (Aharoni and Goldberg, 2017) labeled Hard-Mono 4 . The table below shows that even for this task full-joint scores over existing attention models 5 . The generic full-joint attention provides slight gains even over the task specific hard-monotonic attention.

Conclusion
In this paper we showed a simple yet effective approximation of the joint attention-output distribution in sequence to sequence learning. Our joint model consistently provides higher accuracy without significant running time overheads in five translation and two morphological inflection tasks. An interesting direction for future work is to extend beam-joint to multi-head attention architectures as in (Vaswani et al., 2017;Xu Chen, 2018).