Supervised Attentions for Neural Machine Translation

In this paper, we improve the attention or alignment accuracy of neural machine translation by utilizing the alignments of training sentence pairs. We simply compute the distance between the machine attentions and the"true"alignments, and minimize this cost in the training procedure. Our experiments on large-scale Chinese-to-English task show that our model improves both translation and alignment qualities significantly over the large-vocabulary neural machine translation system, and even beats a state-of-the-art traditional syntax-based system.

The attention model plays a crucial role in NMT, as it shows which source word(s) the model should focus on in order to predict the next target word. However, the attention or alignment quality of NMT is still very low (Mi et al., 2016a;Tu et al., 2016).
In this paper, we alleviate the above issue by utilizing the alignments (human annotated data or machine alignments) of the training set. Given the alignments of all the training sentence pairs, we add an alignment distance cost to the objective function. Thus, we not only maximize the log translation probabilities, but also minimize the alignment distance cost. Large-scale experiments over Chineseto-English on various test sets show that our best method for a single system improves the translation quality significantly over the large vocabulary NMT system (Section 5) and beats the state-of-theart syntax-based system.

Neural Machine Translation
As shown in Figure 1, attention-based NMT (Bahdanau et al., 2014) is an encoder-decoder network. the encoder employs a bi-directional recurrent neural network to encode the source sentence x = (x 1 , ..., x l ), where l is the sentence length (including the end-of-sentence eos ), into a sequence of hidden states h = (h 1 , ..., h l ), each h i is a concatenation of a left-to-right − → h i and a right-to-left ← − h i . Given h, the decoder predicts the target translation by maximizing the conditional log-probability of the correct translation y * = (y * 1 , ...y * m ), where m is the sentence length (including the end-ofsentence). At each time t, the probability of each word y t from a target vocabulary V y is: where g is a two layer feed-forward neural network over the embedding of the previous word y * t−1 , and the hidden state s t . The s t is computed as: where q is a gated recurrent units, H t is a weighted sum of h; the weights, α, are computed with a two layer feed-forward neural network r: We put all α t,i (t = 1...m, i = 1...l) into a matrix A , we have a matrix (alignment) like (c) in Figure 2, where each row (for each target word) is a probability distribution over the source sentence x. The training objective is to maximize the conditional log-probability of the correct translation y e t,j e t,l ↵ t,j = exp(e t,j ) P l i=1 exp(e t,i ) s t Figure 1: The architecture of attention-based NMT (Bahdanau et al., 2014). The source sentence x = (x1, ..., x l ) with length l, x l is an end-of-sentence token eos on the source side. The reference translation is y * = (y * 1 , ..., y * m ) with length m, similarly, y * m is the target side eos .
← − hi and − → hi are bi-directional encoder states. αt,j is the attention probability at time t, position j. Ht is the weighted sum of encoding states. st is a hidden state. ot is an output state. Another one layer neural network projects ot to the target output vocabulary, and conducts softmax to predict the probability distribution over the output vocabulary. The attention model (the right box) is a two layer feedforward neural network, At,j is an intermediate state, then another layer converts it into a real number et,j, the final attention probability at position j is αt,j.
given x with respect to the parameters θ θ * = arg max θ N n=1 m t=1 log p(y * n t |x n , y * n t−1 ..y * n 1 ), (5) where n is the n-th sentence pair (x n , y * n ) in the training set, N is the total number of pairs.

Alignment Component
The attentions, α t,1 ...α t,l , in each step t play an important role in NMT. However, the accuracy is still far behind the traditional MaxEnt alignment model in terms of alignment F1 score (Mi et al., 2016b;Tu et al., 2016). Thus, in this section, we explicitly add an alignment distance to the objective function in Eq. 5. The "truth" alignments for each sentence pair can be from human annotated data, unsupervised or supervised alignments (e.g. GIZA++ (Och and Ney, 2000) or MaxEnt (Ittycheriah and Roukos, 2005)).
Given an alignment matrix A for a sentence pair (x, y) in Figure 2 (a), where we have an end-ofsource-sentence token eos = x l , and we align all the unaligned target words (y * 3 in this example) to eos , also we force y * m (end-of-target-sentence) to be aligned to x l with probability one. Then we conduct two transformations to get the probability distribution matrices ((b) and (c) in Figure 2).

Simple Transformation
The first transformation simply normalizes each row. Figure 2 (b) shows the result matrix A * . The last column in red dashed lines shows the alignments of the special end-of-sentence token eos .

Objectives
Alignment Objective: Given the "true" alignment A * , and the machine attentions A produced by NMT model, we compute the Euclidean distance bewteen A * and A . x 5 x l x 5 x l x 5 NMT Objective: We plug Eq. 6 to Eq. 5, we have There are two parts: translation and alignment, so we can optimize them jointly, or separately (e.g. we first optimize alignment only, then optimize translation). Thus, we divide the network in Figure 1 into alignment A and translation T parts: • A: all networks before the hidden state s t , • T: the network g(s t , y * t−1 ).
If we only optimize A, we keep the parameters in T unchanged. We can also optimize them jointly J. In our experiments, we test different optimization strategies.

Related Work
In order to improve the attention or alignment accuracy, Cheng et al. (2016) adapted the agreementbased learning (Liang et al., 2006;Liang et al., 2008), and introduced a combined objective that takes into account both translation directions (source-to-target and target-to-source) and an agreement term between the two alignment directions.
By contrast, our approach directly uses and optimizes NMT parameters using the "supervised" alignments.

Data Preparation
We run our experiments on Chinese to English task. The training corpus consists of approximately 5 million sentences available within the DARPA BOLT Chinese-English task. The corpus includes a mix of newswire, broadcast news, and webblog. We do not include HK Law, HK Hansard and UN data. The Chinese text is segmented with a segmenter trained on CTB data using conditional random fields (CRF). Our development set is the concatenation of several tuning sets (GALE Dev, P1R6 Dev, and Dev 12) initially released under the DARPA GALE program. The development set is 4491 sentences in total. Our test sets are NIST MT06 (1664 sentences) , MT08 news (691 sentences), and MT08 web (666 sentences). For all NMT systems, the full vocabulary size of the training set is 300k. In the training procedure, we use AdaDelta (Zeiler, 2012) to update model parameters with a mini-batch size 80. Following Mi et al. (2016a), the output vocabulary for each mini-batch or sentence is a sub-set of the full vocabulary. For each source sentence, the sentencelevel target vocabularies are union of top 2k most frequent target words and the top 10 candidates of the word-to-word/phrase translation tables learned  (Mi et al., 2016b) Table 1: Single system results in terms of (TER-BLEU)/2 (T-B, the lower the better) on 5 million Chinese to English training set.
BP denotes the brevity penalty. NMT results are on a large vocabulary (300k) and with UNK replaced. The second column shows different alignments (Zh → En (one direction), GDFA ("grow-diag-final-and"), and MaxEnt (Ittycheriah and Roukos, 2005). A, T, and J mean optimize alignment only, translation only, and jointly. Gau. denotes the smoothed transformation.
from 'fast align' (Dyer et al., 2013). The maximum length of a source phrase is 4. In the training time, we add the reference in order to make the translation reachable.
The Cov. LVNMT system is a re-implementation of the enhanced NMT system of Mi et al. (2016a), which employs a coverage embedding model and achieves better performance over large vocabulary NMT Jean et al. (2015). The coverage embedding dimension of each source word is 100.
Following Jean et al. (2015), we dump the alignments, attentions, for each sentence, and replace UNKs with the word-to-word translation model or the aligned source word.
Our SMT system is a hybrid syntax-based tree-tostring model (Zhao and Al-onaizan, 2008), a simplified version of the joint decoding (Liu et al., 2009;Cmejrek et al., 2013). We parse the Chinese side with Berkeley parser, and align the bilingual sentences with GIZA++ and MaxEnt. and extract Hiero and tree-to-string rules on the training set. Our two 5-gram language models are trained on the English side of the parallel corpus, and on monolingual corpora (around 10 billion words from Gigaword (LDC2011T07), respectively.As suggested by Zhang (2016), NMT systems can achieve better results with the help of those monolingual corpora. In this paper, our NMT systems only use the bilingual data. We tune our system with PRO (Hopkins and May, 2011) to minimize (TER-BLEU)/2 1 on the development set. Table 1 shows the translation results of all systems. The syntax-based statistical machine translation model achieves an average (TER-BLEU)/2 of 13.36 on three test sets. The Cov. LVNMT system achieves an average (TER-BLEU)/2 of 14.24, which is about 0.9 points worse than Tree-to-string SMT system. Please note that all systems are single systems. It is highly possible that ensemble of NMT systems with different random seeds can lead to better results over SMT.
The alignment quality improves from Zh → En to MaxEnt. We also test different optimization strategies: J (jointly), A (alignment only), and T (translation model only). A combination, A → T, shows that we optimize A only first, then we fix A and only update T part. Gau. denotes the smoothed transformation (Section 3.2). Only the last row uses the smoothed transformation, all others use the simple transformation.
Experimental results in Table 1 show some interesting results. First, with the same alignment, J joint optimization works best than other optimization strategies (lines 3 to 6). Unfortunately, breaking down the network into two separate parts (A and T) and optimizing them separately do not help (lines 3 to 5). We have to conduct joint optimization J in order to get a comparable or better result (lines 3, 5 and 6) over the baseline system.
Second, when we change the training alignment seeds (Zh → En, GDFA, and MaxEnt) NMT model does not yield significant different results (lines 6 to 8).
Third, the smoothed transformation (J + Gau.) gives some improvements over the simple transformation (the last two lines), and achieves the best result (1.2 better than LVNMT, and 0.3 better than Tree-to-string). In terms of BLEU scores, we conduct the statistical significance tests with the signtest of Collins et al. (2005), the results show that the improvements of our J + Gau. over LVNMT are significant on three test sets (p < 0.01).
At last, the brevity penalty (BP) consistently gets better after we add the alignment cost to NMT objective. Our alignment objective adjusts the translation length to be more in line with the human references accordingly.  (Mi et al., 2016b)   And we further boost the score to 50.97 by tuning alignment and translation jointly (J in line 7). Interestingly, the system using MaxEnt produces more alignments in the output, and results in a higher recall. This suggests that using MaxEnt can lead to a sharper attention distribution, as we pick the alignment links based on the probabilities of attentions, the sharper the distribution is, more links we can pick. We believe that a sharp attention distribution is a great property of NMT. Again, the best result is J + Gau. in the last row, which significantly improves the F1 by 5 points over the baseline Cov. LVNMT system. When we use MaxEnt alignments, J + Gau. smoothing gives us about 1.7 points gain over J system. So it looks interesting to run another J + Gau. over GDFA alignment.

Alignment Results
Together with the results in Table 1, we conclude that adding the alignment cost to the training objective helps both translation and alignment significantly.

Conclusion
In this paper, we utilize the "supervised" alignments, and put the alignment cost to the NMT objective function. In this way, we directly optimize the attention model in a supervised way. Experiments show significant improvements in both translation and alignment tasks over a very strong LVNMT system.