Look Harder: A Neural Machine Translation Model with Hard Attention

Soft-attention based Neural Machine Translation (NMT) models have achieved promising results on several translation tasks. These models attend all the words in the source sequence for each target token, which makes them ineffective for long sequence translation. In this work, we propose a hard-attention based NMT model which selects a subset of source tokens for each target token to effectively handle long sequence translation. Due to the discrete nature of the hard-attention mechanism, we design a reinforcement learning algorithm coupled with reward shaping strategy to efficiently train it. Experimental results show that the proposed model performs better on long sequences and thereby achieves significant BLEU score improvement on English-German (EN-DE) and English-French (ENFR) translation tasks compared to the soft attention based NMT.


Introduction
In recent years, soft-attention based neural machine translation models (Bahdanau et al., 2015;Gehring et al., 2017;Hassan et al., 2018) have achieved state-of-the-art results on different machine translation tasks.The soft-attention mechanism computes the context (encoder-decoder attention) vector for each target token by weighting and combining all the tokens of the source sequence, which makes them ineffective for long sequence translation (Lawson et al., 2017).Moreover, weighting and combining all the tokens of the source sequence may not be required -a few relevant tokens are sufficient for each target token.
Different attention mechanisms have been proposed to improve the quality of the context vector.For example, Luong et al. (2015); Yang et al. (2018) proposed a local-attention mechanism to selectively focus on a small window of source tokens to compute the context vector.Even though local-attention has improved the translation quality, it is not flexible enough to focus on relevant tokens when they fall outside the specified window size.
To overcome the shortcomings of the above approaches, we propose a hard-attention mechanism for a deep NMT model (Vaswani et al., 2017).The proposed model solely selects a few relevant tokens across the entire source sequence for each target token to effectively handle long sequence translation.Due to the discrete nature of the hard-attention mechanism, we design a Reinforcement Learning (RL) algorithm with reward shaping strategy (Ng et al., 1999) to train it.The proposed hard-attention based NMT model consistently outperforms the soft-attention based NMT model (Vaswani et al., 2017), and the gap grows as the sequence length increases.

Background
A typical NMT model based on encoder-decoder architecture generates a target sequence y = {y 1 , • • • , y n } given a source sequence x = {x 1 , • • • , x m } by modeling the conditional probability p(y|x, θ).The encoder (θ e ) computes a set of representations Z = {z 1 , • • • , z m } ∈ R m×d corresponding to x and the decoder (θ d ) generates one target word at a time using the context vector computed using Z.It is trained on a set of D parallel sequences to maximize the log likelihood: where θ = {θ e , θ d }.
In recent years, among all the encoder-decoder architectures for NMT, Transformer (Vaswani et al., 2017) has achieved the best translation quality (Wu et al., 2018).The encoder and decoder blocks of the Transformer are composed of a stack of N (=6) identical layers as shown in Figure 1a.Each layer in the encoder contains two sublayers, a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.Each decoder layer consists of three sublayers; the first and third sub-layers are similar to the encoder sub-layers, and the additional second sub-layer is used to compute the encoderdecoder attention (context) vector based on the soft-attention based approaches (Bahdanau et al., 2015;Gehring et al., 2017).Here we briefly describe the soft computation of encoder-decoder attention vector in the Transformer architecture.Please refer Vaswani et al. (2017) for the detailed architecture.
For each target word ŷt , the second sub-layer in the decoder computes encoder-decoder attention a t based on the encoder representations, Z.In practice we compute the attention vectors simultaneously for all the time steps by packing ŷt 's and z i 's in to matrices.The soft attention of the encoder-decoder, A i , for all the decoding steps is computed as follows: (2) where d is the dimension and Ŷ i−1 ∈ R n×d is the decoder output from the previous layer.

Proposed Model
Section 3.1 introduces our proposed hard-attention mechanism to compute the context vector for each target token.We train the proposed model by designing a RL algorithm with reward shaping strategy -described in Section 3.2.

Hard Attention
Instead of computing the weighted average over all the encoder output as shown in Eq. 2, we specifically select a subset of encoder outputs (z i 's) for the last layer (N ) of the decoder using the hard-attention mechanism as shown in Figure 1a.This allows us to efficiently compute the encoder-decoder attention vector for long sequences.To compute the hard-attention between the last layers of the Transformer encoder-decoder blocks, we replace the second sub-layer of the decoder block's last layer with the RL agent based attention mechanism.Overview of the proposed RL agent based attention mechanism is shown in Figure 1b and computed as follows: First, we learn the projections Ỹ N −1 , Z for Ŷ N −1 and Z as, We then compute the attention scores S as, We apply the hard-attention mechanism on attention scores (S) to dynamically choose multiple relevant encoder tokens for each decoding token.Given S, this mechanism generates an equal length of binary random-variables, β = {β 1 , • • • , β m } for each target token, where β i = 1 indicates that z i is relevant whereas β i = 0 indicates that z i is irrelevant.The relevant tokens are sampled using bernoulli distribution over each β i for all the target tokens.This hard selection of encoder outputs introduces discrete latent variables and estimating them requires RL algorithms.Hence, we design the following reinforcement learner policy for the hard-attention for each decoding step t.
where β t i ∈ β represents the probability of a encoder output (agent's action) being selected at time t, and s t ∈ S is the state of the environment.Now, the hard encoder-decoder attention, Ã, is calculated as, follows: Unlike the soft encoder-decoder attention A in Eq. 2, which contains the weighted average of entire encoder outputs, the hard encoder-decoder attention Ã in Eq. 6 contains information from only relevant encoder outputs for each decoding step.

Strategies for RL training
The model parameters come from the encoder, decoder blocks and reinforcement learning agent, which are denoted as θ e , θ d and θ h respectively.Estimation of θ e and θ d is done by using the objective J 1 in Eq. 1 and gradient descent algorithm.However, estimating θ h is difficult given their discrete nature.Therefore, we formulate the estimation of θ h as a reinforcement learning problem and design a reward function over it.An overveiw of the proposed RL training is given in Algorithm 1.We use BLEU (Papineni et al., 2002) score between the predicted sequence and the target sequence as our reward function, denoted as R(y , y), where y is the predicted output sequence.The objective is to maximize the reward with respect to the agent's action in Eq. 4 and defined as, Compute the reward r t using Eq. 8 using Eq. 1 and Eq. 9 11 Update the parameters θ with gradient descent: θ = θ − α∇J(θ) 12 end 13 Return: θ Reward Shaping To generate the complete target sentence, the agent needs to take actions at each target word, but only one reward is available for all these tens of thousands of actions.This makes RL training inefficient since the same terminal reward is applied to all the intermediate actions.To overcome this issue we adopt reward shaping strategy of Ng et al. (1999).This strategy assigns distinct rewards to each intermediate action taken by the agent.The intermediate reward, denoted as r t (y t , y), for the agent action at decoding step t is computed as: During training, we use cumulative reward n t=1 r t (y t , y) achieved from the decoding step t to update the agent's policy.
Entropy Bonus We add entropy bonus to avoid policy to collapse too quickly.The entropy bonus encourages an agent to take actions more unpredictably, rather than less so.The RL objective function in Eq. 7 becomes, We approximate the gradient ∆ θ h Ĵ2 (θ h ) by using REINFORCE (Williams, 1992)    4 Experimental Results

Datasets
We conduct experiments on WMT 2014 English-German (EN-DE) and English-French (EN-FR) translation tasks.The approximate number of training pairs in EN-DE and EN-FR datasets are 4.5M and 36M respectively; newstest2013 and newstest2014 are used as the dev and test sets.We follow the similar preprocessing steps as described in Vaswani et al. (2017) for both the datasets.We encode the sentences using word-piece vocabulary (Wu et al., 2016) and the shared source-target vocabulary size is set to 32000 tokens.

Implementation Details
We adopt the implementation of the Transformer (Vaswani et al., 2018) with transformer big settings.All the models are trained using 8 NVIDIA Tesla P40 GPUs on a single machine.The BLEU score used in the reward shaping is calculated similarly to Bahdanau et al. (2017); all the n-gram counts start from 1, and the resulting score is mul-tiplied by the length of the target reference sentence.The beam search width (=4) is set empirically based on the dev set performance and λ in Eq. 9 is set to 1e-3 in all the experiments.

Models
We compare the proposed model with the soft-attention based Transformer model (Vaswani et al., 2017).To check whether the performance improvements are coming from the hard-attention mechanism (Eq.4) or from the sequence reward incorporated in the objective function (Eq.7), we compare our work with previously proposed sequence loss based NMT method (Wu et al., 2018).This NMT method is built on top of the Transformer model and trained by combing cross-entropy loss and sequence reward (BLEU score).We also compare our model with the recently proposed Localness Self-Attention network (Yang et al., 2018) which incorporates a localness bias into the Transformer attention distribution to capture useful local context.

Main Results
Table 1 shows the performance of various models on EN-DE and EN-FR translation tasks.These test set case sensitive BLEU scores are obtained using SacreBLEU toolkit1 (Post, 2018).The BLEU score difference between our hard-attention based Transformer model and the original soft-attention based Transformer model indicates the effectiveness of selecting a few relevant source tokens for each target token.The performance gap between our method and sequence loss based Transformer (Wu et al., 2018) shows that the improvements are indeed coming from the hard-attention mechanism.Our approach of incorporating hard-attention into decoder's top selfattention layer to select relevant tokens yielded better results compared to the Localness Self-Attention (Yang et al., 2018) approach of incorporating localness bias only to lower self-attention layers.It can be noted that our model achieved 29.29 and 42.26 BLEU points on EN-DE and EN-FR tasks respectively -surpassing the previously published models.
Analysis To see the effect of the hard-attention mechanism for longer sequences, we group the sequences in the test set based on their length and compute the BLEU score for each group.Table 2 shows the number of sequences present in each group.Figure 2 shows that Transformer with hard attention is more effective in handling the long sequences.Specifically, the performance gap between our model (THA) and the original Transformer model (TSA) grows bigger as sequences become longer.

Related Work
Even though RL based models are difficult to train, in recent years, multiple works (Mnih et al., 2014;Choi et al., 2017;Yu et al., 2017;Narayan et al., 2018;Sathish et al., 2018;Shen et al., 2018) have shown to improve the performance of several natural language processing tasks.Also, it has been used in NMT (Edunov et al., 2018;Wu et al., 2017;Bahdanau et al., 2017) to overcome the inconsistency between the token level objective function and sequence-level evaluation metrics such as BLEU.Our approach is also related to the method proposed by Lei et al. (2016) to explain the decision of text classifier.However, here we focus on selecting a few relevant tokens from a source sequence in a translation task.
Recently, several innovations are proposed on top of the Transformer model to improve performance and training speed.For example, Shaw et al. (2018) incorporated relative positions and Ott et al. (2018) proposed efficient training strategies.These improvements are complementary to the proposed method.Incorporating these techniques will further improve the performance of the proposed method.

Conclusion
In this work, we proposed a hard-attention based NMT model which focuses solely on a few relevant source sequence tokens for each target token to effectively handle long sequence translation.We train our model by designing an RL algorithm with the reward shaping strategy.Our model sets new state-of-the-art results on EN-DE and EN-FR translation tasks.
Figure 1: (a) Overview of hard-attention based Transformer network.(b) Overview of RL agent based hardattention and objective function.

Figure 2 :
Figure 2: Performance of Transformer with Soft-Attention (TSA), and Transformer with Hard Attention (THA) for various sequence lengths on EN-DE and EN-FR translation tasks.

Table 1 :
algorithm which allows us to jointly train J 1 (θ e , θ d ) and Ĵ2 (θ h ).Performance of various models onEN-DE and EN-FR translation tasks.

Table 2 :
Number of sequences present in each group (based on sequence length) of EN-DE and EN-FR testsets.