Target Foresight Based Attention for Neural Machine Translation

In neural machine translation, an attention model is used to identify the aligned source words for a target word (target foresight word) in order to select translation context, but it does not make use of any information of this target foresight word at all. Previous work proposed an approach to improve the attention model by explicitly accessing this target foresight word and demonstrated the substantial gains in alignment task. However, this approach is useless in machine translation task on which the target foresight word is unavailable. In this paper, we propose a new attention model enhanced by the implicit information of target foresight word oriented to both alignment and translation tasks. Empirical experiments on Chinese-to-English and Japanese-to-English datasets show that the proposed attention model delivers significant improvements in terms of both alignment error rate and BLEU.


Introduction
Since neural machine translation (NMT) was proposed (Bahdanau et al., 2014), it has been attracted increasing interests in machine translation community (Luong et al., 2015b;Tu et al., 2016;Feng et al., 2016;Cohn et al., 2016). NMT not only yields impressive translation performance in practice, but also has appealing model architecture in essence. Compared with traditional statistical machine translation (Koehn et al., 2003;Chiang, 2005), one of advantages in NMT is that its architecture combines language model, translation model and alignment between source and target words in a unified manner rather than a * Work done when X. Li interning at Tencent AI Lab. L. Liu is the corresponding author. pipeline manner, and it thereby has the potential to alleviate the issue of error propagation. In NMT, the attention mechanism plays an important role. It calculates the alignments of a target word with respect to the source words for translation context selection. Although the source words are always available in inference, the target word, called target foresight word, 1 i.e. the first light color word in Figure 1(a), is not known but to be translated at the next time step. Therefore, this may lead to inadequate modeling for attention mechanism (Liu et al., 2016a;Peter et al., 2017). Regarding to this, Peter et al. (2017) explicitly feed this target word into the attention model, and demonstrate the significant improvements in alignment accuracy. Unfortunately, this approach relies on the premise that the target foresight word is available in advance in its alignment scenario, and thus it can not be used in the translation scenario.
To address this issue, in this paper, we propose a target foresight based attention (TFA) model oriented to both alignment and translation tasks. Its basic idea includes two steps: it firstly designs an auxiliary mechanism to predict some information for the target foresight word which is helpful for alignment; and then it feeds the predicted result into the attention model for translation. For the sake of efficiency, instead of predicting the target foresight word with large vocabulary size, we only predict its partial information, i.e. partof-speech tag, which is proved to be helpful for word alignment (Liu et al., 2005). Figure 1( b) shows the main idea of TFA based on NMT. In order to remit the negative effects due to the prediction errors, we feed the distribution of the prediction result instead of the maximum a posteriori result into the attention model. In addition, since the target foresight words are available during the training, we jointly learn the prediction model for the target foresight words and the translation model in a supervised manner.
This paper makes the following contributions: • It proposes a novel TFA-NMT for neural machine translation by using an auxiliary mechanism to predict the target foresight word which is subsequently used to enhance the attention model.
• It empirically shows that the proposed TFA-NMT can lead to better alignment accuracy, and achieves significant improvements on both Chinese-to-English and Japanese-to-English translation tasks.

Background
Given a source sentence x = {x 1 , . . . , x m } with length m and a target sentence y = {y 1 , . . . , y n } with length n, neural machine translation aims to model the conditional probability P (y | x): where y <i = {y 1 , . . . , y i−1 } denotes a prefix of y with length i − 1.
To achieve this, neural machine translation adopts recurrent neural network (RNN) under the encoder-decoder framework (Bahdanau et al., 2014). In encoding, an encoder reads the source sentence x into a sequence of representation vectors by a bidirectional recurrent neural network. Suppose h i denotes the representation vector for x i , and let h = {h 1 , . . . , h m }. In decoding, a decoder sequentially generates a target word according to P (y i | y <i , x) by using another RNN.
In Eq.(1), the distribution P (y i | y <i , x) is used to generate y i as follows: (2) where ϕ represents a feedforward neural network, c i is the context vector from h to infer y i , and s i denotes the hidden state at timestamp i via the decoding RNN represented by f : Bahdanau et al. (2014) propose an attention model to define the context c i , inspired by the alignment model in statistical machine translation.
Given the last hidden state s i−1 and the encoding vectors h, an attention model is based on a distribution consisting of α ij as follows: where e ij is computed by a feedforward neural network represented by a: The quantity α ij denotes the possibility of target word y i aligns to the source word x j encoded by h j . According to α ij , the context vector c i is defined as the weighted sum of h: In this way, when translating the target word y i , the decoder will pay more attention to its aligned source words with respect to the distribution α i = {α i1 , · · · , α im }. Figure 2 shows a slice of the entire architecture for NMT at timestamp i. Unfortunately, even though the entire translation y is available in training, during the inference it is unknown in advance but to be generated sequentially. Specifically, when calculating α i , one can make use of the information only from x and y <i but nothing from y i . Therefore, it is difficult to certainly specify which source words should be aligned to an unknown target word y i . This might lead to the inadequacy of the attention model (Liu et al., 2016a;Peter et al., 2017), as explained in Figure 1(a).

Target Foresight Attention
In order to alleviate the issue of inadequate modeling for attention in NMT, in this section, we propose the target foresight attention for NMT, which foresees some related information of the unknown target foresight word to improve its alignments regarding to source words. The basic idea of the proposed attention model includes two steps as following: • It firstly introduce a model to predict some information of the target foresight word. ( §3.1) • It then feeds the predicted result about the foresight target word into the attention as an additional input. ( §3.2) Therefore, as shown in Figure 1(b), when translating the third word, if the prediction model shows it to be a "VBZ", the attention model is likely to align it to the verb words such as "huí shēng" rather than "rén shù" in the source side, and then the corrected word "rises" will be translated.

Target Foresight Prediction
Ideally, it is possible to build a model to directly predict the target foresight word itself. In practice, it will be inefficient due to its large vocabulary size. As a result, we instead build a model to predict the partial information of the target foresight word, such as part-of-speech (POS) tag or word cluster, which has limited vocabulary size. In this paper, we use the POS tag as the partial information of a target foresight word because POS tag is helpful to word alignment proved by Liu et al. (2005). Furthermore, predicting a POS tag is easier than a target foresight word, so the predicted result will be more reliable for the downstream application on attention.
Suppose u i denotes a variable indicating the POS tag of a target foresight word y i . Our aim is to define a prediction model of u i prior to calculate the attention probability. For simplicity, this prediction model is generally represented as β i = P (u i | y <i , x). We consider three variant prediction models in a coarse-tofine manner as follows.

Model 1
It is straightforward to define this prediction model directly based on the hidden states of the RNN in decoder by using a neural network. Formally, one can use the following equation: where ψ is implemented by a feedforward neural network. Note that Eq.(6) only depends on the decoding RNN hidden state s i−1 and it is very simple to implementation. Figure 3(a) shows its architecture.

Model 2
Unlike Eq.(6) relying on the same hidden s i−1 as the decoder, we design a specialized RNN to provide a particular hidden state for prediction of u i . This improved prediction model is defined as follows: where t i is the hidden state of the specialized RNN defined by a GRU unit, i.e. t i = g(t i−1 , y i−1 ). This prediction model architecture is shown in Figure 3(b).

Model 3
In model 2, the specialized RNN for u i only cares about the target sentence y and ignores the information from the source sentence x. We define a fine-grained model by taking a context vector c ′ i from x as an additional input: is the hidden state of the specialized RNN. The architecture of this model is shown in Figure 3(c).

Feeding the Prediction Model
Suppose we have the prediction result P (u i | y <i , x), then we consider to feed it into the attention model. Firstly, it is natural to feed the prediction into attention by using maximum a posteriori (MAP) strategy: 2 In our preliminary experiments, we tried ci, but we found c ′ i performs better.
where a is the function for attention similar to Eq.(4) but includes an additional input z i , which is the MAP result of P (u i | y <i , x): where z denotes the embeddings of the POS tags of target foresight words, and z(u i ) returns the embedding of a particular POS tag u i . Note that in Eq.(10) the accuracy of P (u i | y <i , x) is important to the attention model. For example, suppose at timestamp i, the ground-truth POS tag is "NN", but one has P (u i = NN | y <i , x) = 0.4 and P (u i = VV | y <i , x) = 0.41. In this case, the prediction model selects "VV" as the POS tag of the target foresight word and ignores the groundtruth tag "NN". Then the attention model takes this error signal and may align the target foresight word to a verb word. Subsequently, this might lead to a translation error. Therefore, we propose another method to integrate the expected embedding of u i according to P (u i | y <i , x) into attention as follows: In this way, z i can take into account all possible POS tags u i including the ground-truth result.
Until now, we can obtain the entire architecture of the proposed target foresight attention based NMT (TFA-NMT), as shown in Figure 4. Comparing Figure 4 with Figure 2, the only difference is the variable z i , which is obtained from Eq.(10-11) and the prediction model as shown in Figure 3.
Note that the proposed TFA-NMT models the target foresight word, which is a future word regarding to the current time step, to conduct attention calculation. In this sense, it employs the idea of modeling future and thus resembles to the work in . The main difference is that TFA-NMT models the future from the target side whereas  models the future from the source side. In addition, Weng et al. (2017) imposes a regularization term by using future words during training. Unlike our approach, their approach does not use future words during the inference because these words are unavailable. Anyway, it is possible to put both their approach and our approach together for further improvements.

Learning and Inference
Suppose a set of training data is denoted by {⟨ x k , y k , u k ⟩ | k = 1, · · · , K } . Here x k , y k and u k denotes a source sentence, a target sentence and a POS tag sequence of y k , respectively. Then one can jointly train both the translation model for y k and the prediction model for u k by minimizing the loss function: where P (y k i | y k <i , x k ) is the translation model similar to Eq.(2) with target foresight attention, and P (u k i | y k <i , x k ) is the target foresight prediction model as defined in Eq.(6-8), respectively. λ ≥ 0 is a hyper-parameter that balances the preference between the translation model and target foresight prediction model.
According to the training objective, the proposed TFA-NMT resembles to the multi-task learning, since it jointly learns two tasks similar to (Evgeniou and Pontil, 2004;Luong et al., 2015a). The difference of our approach is obviously: in this work the prediction result of one model is integrated into the other model, while in their works, two models only share some common hidden states.
In inference, we implement two different decoding methods according two different ways to integrate the foresight prediction model into attention as described in §3.2. For the MAP feeding style, we optimize u i according to the loss function in Eq.(12) by beam search besides optimizing y i . However, for the expectation feeding style, we maintain the standard beam search algorithm only regarding to the translation model, i.e. by setting λ = 0.

Experiments
We conduct experiments on Chinese-to-English and Japanese-to-English translation tasks. The specific analyses are based on Chinese-to-English task, and the generalization ability is shown by Japanese-to-English task. Case-insensitive 4-gram BLEU is used to evaluate translation quality, and the multibleu.perl is adopted as its implementation.

Setup
Data The training data for Chinese-to-English task consists of 1.8M sentence pairs from NIST2008 Open Machine Campaign, with 40.1M Chinese words and 48.3M English words respectively. The development set is chosen as NIST2002 (878 sentences) and the test sets are NIST2005 (1082 sentences), NIST2006 (1664 sentences), and NIST2008 (1357 sentences).
For Japanese-to-English translation, we adopt the data sets from NTCIR-9 patent translation task (Goto et al., 2013 Table 1: Speeds and performances of the proposed models. "Speed" is measured in words/second for both training and decoding, and performances are measured in terms of BLEU scores ("BLEU") and foresight prediction accuracy ("FPA") on the development set. Higher BLEU and FPA scores denote better performance. erence, following (Goto et al., 2013;Liu et al., 2016b) for further comparison.
Implementation We compare the proposed models with two strong baselines from SMT and NMT: • Moses (Koehn et al., 2007): an open source phrased based translation system with default configuration.
We implement the proposed models on top of Nematus. We use Stanford Log-linear Part-Of-Speech Tagger (Toutanova et al., 2003) to produce POS tags for the English side. For both Chinese-to-English and Japanese-to-English tasks, we limit the vocabularies to the most frequent 30K words for both sides. All the out-of-vocabulary words are mapped to a spacial token "UNK". Only the sentences of length up to 50 words are used in training, with 80 sentences in a batch. The dimension of word embedding is 620. The dimensions of both feed forward NN and RNN hidden layer are 1000. The beam size for decoding is 12, and the cost function is optimized by Adadelta with hyper-parameters suggested by Zeiler (2012). Particularly for TFA-NMT, the foresight embedding is also 620, and the hyper-parameter λ is 1.

Impact of Components
We conduct analyses on Chinese-to-English translation task, to investigate the impact of the added components and to figure out their best configuration for further testing in the next subsection. Table 1 lists the speeds and performances of the proposed models. Clearly the proposed approach improves the translation quality in all cases, although there are still considerable differences among the proposed variants.

Model Complexity
The proposed models introduce a few parameters to the NMT baseline system Nematus, which has 105M parameters. The most complex model (i.e., Model3) introduces 27M new parameters, which are small compared with the baseline model. As seen, the proposed models significantly slows down the training speed, which we attribute to the new softmax operation over the foresight tags and more gradient operations associated with the new training objective, i.e., Eq.(12). For decoding, the most complex model reduces speed by around 30%, which is the cost of the proposed approach for improving translation quality.
Performance We measure the performance with BLEU and the result is shown in Table 1. Model1 marginally improves performance by guiding the decoder states to embed information for predicting foresight tags. Model2 achieves further improvement by introducing a new specific hidden layer to explicitly separate the predict function from the decoder states. Model3 achieves the best performance by adopting an independent attention model to attend corresponding source parts for foresight prediction, which may not be the same as the attended source parts for translation. We conduct the significant test using Kevin Gimpel's toolkit (Clark et al., 2011). We found that Model1 is not signif-  Table 2: Performances on syntactic categories. "Base" denotes "Nematus", and Ours denotes the proposed model. icantly better than baseline, but Model2 is significantly better with p<0.05 and Model3 is significantly better with p<0.01. Given that simply introducing an additional layer ("+2-Layer") does not produce any improvement on this data, we believe the gain of our model is not only from the more introduced parameters. Besides, we augment the word embedding by concatenating the POS tag embedding, proposed by (Sennrich and Haddow, 2016), the BLEU is 38.96, which indicating the improvement of our model is not only from the POS tagging. In order to further validate the improvements of variant proposed models, we evaluate the foresight prediction accuracy (FPA) for three proposed prediction models. We found that the fine-grained Model3 achieves the best FPA, indicating a good estimated foresight is very important to obtain the gains in terms of BLEU.

Analysis on Syntactic Categories
In this experiment, we investigate which category of generated words benefit most from the proposed approach in terms of alignments measured by alignment error rate (AER) (Och, 2003). We carry out experiments on the evaluation dataset from (Liu and Sun, 2015), which contains 900 manually aligned Chinese-English sentence pairs. Following (Luong et al., 2015b), we force-decode both the bilingual sentences including source and reference sentences to obtain the attention matrices, and then we extract one-to-one alignments by picking up the source word with the highest alignment confidence as the hard align-  Table 3: Effect of foresight supervision signal in training (i.e., λ) and foresight representations in decoding: Exp for expectation and Map for maximum a posteriori. ment. As shown in Table 2, the AER improvements are modest for content words such as Noun, Verb, and adjective ("Adj.") words; but there are substantial improvements for function words such as preposition words ("Prep.") and punctuations ("Punc.").
The reason can be explained as follows. The content words are easy to align with AER under 38 as shown in Table 2, and thus it is more difficult to gain over the BASE. On the other hand, as depicted in Table 2, function words are inherently more difficult than content words. These findings satisfy the linguistic intuition: content words tend to be less involved in multiple potential correspondences than function words, and function words tend to be attached to content words, as pointed out by Pianta and Bentivogli (2004). Fortunately, TFA-NMT can predict the POS tag for target foresight word with high confidence and thus it can improve the alignment quality by using of POS tags, which is useful for alignment task (Liu et al., 2005).
It is surprising that the AER for "Prep.", "Det." and "Punc." is relatively low especially for Base. The main reason can be explained from the quantities y i−1 , s i , and c i in Eq.(2) as follows. These highly frequent function words are usually easy to be translated by using the history information from y i−1 and s i even if c i is not confident enough. For example, it is relatively easy to guess the "comma" by using the history words in language model task, where there are no bilingual information at all. Therefore, during the training, the model tries to adjust the parameters for highly frequent words from y i−1 and s i while neglecting the attention model. Table 3 shows the performances of different foresight strategies in both training and de-coding. Without an explicit objective to guide the training of foresight prediction model (i.e., λ = 0), the performance decreases by 1.27 BLEU points. When feeding the best foresight predicted result to the attention model (i.e., Map), the performance decreases by 0.29 BLEU points. We attribute this to the propagation of prediction errors, which can be alleviated by using a weighted representation of all predicted results (i.e., Exp).

Foresight Strategies
In the following experiments, we use "λ = 1 and Exp" as the default setting for the final system TFA-NMT. Task Table 4 shows the translation performances for the Chineseto-English translation task. As seen, the proposed approach significantly outperforms the baseline system (i.e., Nematus) in all cases, demonstrating the effectiveness and university of our model. Task Table 5 shows the translation quality of the NMT baseline and our TFA-NMT on Japanese-to-English task. From the table, we can see that our model still achieves a significant improvement of 1.22 and 1.31 BLEU points on the development and test set, respectively. This shows that the proposed approach works well across different language pairs.

Related Work
Attention model becomes a standard component for many applications due to its ability of dynamically selecting the informative context from sequential representations. For example, Xu et al. (2015) propose an attention based neural network for image caption task and advance the state-of-the-art results; Yin et al. (2015) put the attention structure between a pair of convolution networks for answer selection, paraphrase identification and textual entailment tasks. In the context of machine translation, the idea of attention based neural networks has been pioneered by Bahdanau et al. (2014); Luong et al. (2015b) and achieved impressive results over the traditional statistical machine translation. Since then many research works have been devoted to improve the neural machine translation by enhancing attention models. Tu et al. (2016)    introduce the syntactic knowledge into attention models. These works are essentially similar to the propose approach, since we introduce auxiliary information from a target foresight word into the attention model. However, there is a significant difference between our approach and their approaches. Our auxiliary information biases to the word to be translated at next timestep while theirs biases to the information available so far at the current timestep, and thereby our approach is orthogonal to theirs.
The works mentioned above improve the attention models by access auxiliary information, and thus they modify the structure of attention models in both inference and learning. In contrast, ; Liu et al. (2016b); Chen et al. (2016) maintain the structure of the attention models in inference but utilize some external signals to supervise the outputs of attention models during the learning. They improve the generalization abilities of attention models by use of the external aligners as the signals, which typically yield alignment results accurate enough to guide the learning of attention.

Conclusion
It has been argued that the traditional attention model in neural machine translation suf-   fers from model inadequacy due to the lack of information from the target foresight word (Peter et al., 2017;Liu et al., 2016a). To address this issue, this paper proposes a new attention model, which can serve for both alignment and translation tasks, by implicitly making use of the target foresight word. Empirical experiments on Chinese-to-English and Japanese-to-English tasks demonstrate that the proposed attention based NMT delivers substantial gains in terms of both BLEU and AER scores. In future work, it is promising to exploit other target foresight information such as word cluster besides the POS tags in this paper, and it is also interesting to apply this idea on top of other attention models such as the local attention in Luong et al. (2015b).