Adaptive Multi-pass Decoder for Neural Machine Translation

Although end-to-end neural machine translation (NMT) has achieved remarkable progress in the recent years, the idea of adopting multi-pass decoding mechanism into conventional NMT is not well explored. In this paper, we propose a novel architecture called adaptive multi-pass decoder, which introduces a flexible multi-pass polishing mechanism to extend the capacity of NMT via reinforcement learning. More specifically, we adopt an extra policy network to automatically choose a suitable and effective number of decoding passes, according to the complexity of source sentences and the quality of the generated translations. Extensive experiments on Chinese-English translation demonstrate the effectiveness of our proposed adaptive multi-pass decoder upon the conventional NMT with a significant improvement about 1.55 BLEU.


Introduction
In the past several years, end-to-end neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014; has attracted increasing attention from both academic and industry communities. Compared with conventional statistical machine translation (SMT) (Brown et al., 1993;Koehn et al., 2003), which needs to explicitly model latent structures, NMT adopts a unified encoder-decoder framework to directly transform a source sentence into a target sentence. Furthermore, the introduction of attention mechanism  enhances the capability of NMT in capturing long-distance dependencies.
Recently, a number of authors have endeavored to adopt the polishing mechanism into NMT. Similar to human cognitive process for writing a good paper, their models first create a complete * Corresponding author.
Reference all appointments are for two years , except that of mr ho sai -chu 's which is for one year in order to tie in with the expiry date of his appointment as an ha member .
1st-pass mr ho sai -chu 's UNK is a year -long term of two years with a term of two years as the term of his term of office of the ha .
2nd-pass mr ho sai -chu 's UNK is a year -long term of two years with a term of two years to serve as the term of office of the ha .
3rd-pass mr ho sai -chu 's UNK is a year -long term of two years with a term of two years to tie in with the expiry date of his term of office .
4th-pass mr ho sai -chu has been serving as a member of authority for a term of two years with a term of two years . draft and then polish it based on global understanding of the whole draft (Niehues et al., 2016;Chatterjee et al., 2016;Zhou et al., 2017;Xia et al., 2017;Junczys Dowmunt and Grundkiewicz, 2017) . Moreover, Zhang et al. (2018) introduces a backward decoder to better exploit the right-toleft target-side contexts. Generally these methods employ two separate decoders to accomplish the polishing task.
Although these polishing mechanism-based approaches demonstrate their effectiveness with twopass decoding, the idea of multi-pass decoding is not well explored for NMT. Motivated by it, we first propose a novel multi-pass decoder to perform the translation procedure with a fixed number of decoding passes, referred to as decoding depth. According to the preliminary results, just as expected, multi-pass decoding really benefit to most translations. However, in some cases, the more decoding passes perhaps lead to the poor translation. For example in Table 1, the 3rd-pass de-coding achieves a better result compared to 1stand 2nd-pass decoding. Nevertheless, a drastic decrease arises, when we perform the 4th-pass decoding. Therefore, it's necessary to introduce a flexible multi-pass decoding, which has the ability to adaptively choose the suitable decoding passes.
Towards above goal, we further propose a novel framework called adaptive multi-pass decoder to automatically choose a proper decoding depth using reinforcement learning. Our model considers multi-pass decoding as a sequential decision making process, where continuing decoding or halt is chosen at each step. An extra policy network is employed to learn to automatically choose to continue next pass decoding or halt via reinforcement learning. For the purpose of making accurate and effective choices, the policy network employs recurrent neural network to capture the complexity of source sentence as well as the difference between the consecutive generated translations. Extensive experiments on Chinese-English translation show the proposed adaptive multi-pass decoder is capable of choosing a suitable decoding depth and significantly improves translation performance over conventional NMT model.

Background
Given a source sentence x = x 1 , . . . , x m , . . . , x M and a target sentence y = y 1 , . . . , y n , . . . , y N , end-to-end neural machine translation directly models translation probability word by word as a single, large neural network: where θ is a set of model parameters and y <n denotes a partial translation. Prediction of n-th word is generally made in an encoder-decoder framework: where g(·) is a non-linear function, y n−1 denotes the previously generated word, s n is n-th decoding hidden state, and c n is a context vector for generating n-th target word. The decoder state s n is computed by RNNs as follows: where f (·) is an activation function. Actually it's found gated RNN alternatives such as LSTM (Hochreiter and Schmidhuber, 1997) or GRU  often achieve better performance than vanilla ones. c n is a dynamic vector that selectively summarizes certain parts of source sentence at each decoding step: where α m,n measures how well x m and y n are aligned, calculated by attention model Luong et al., 2015), and h m is the encoder hidden state of the m-th source word. For the purpose of capturing both forward and backward contexts, bidirectional RNN (Schuster and Paliwal, 1997) is often employed as the encoder which converts the source sentence into an annota- captures information about mth word with respect to the preceding and following words in the source sentence respectively.
Although the introduction of RNNs as a decoder has resulted in substantial improvements in terms of translation quality, simultaneously it imposes a serious restriction on the capability of encoder-decoder framework caused by the structure of RNNs. That is, when the RNN decoder generates the t-th word y t in decoding phase, only y <t can be utilized, while the possible words y >t are directly neglected. Thus, it's difficult to capture global information especially the ungenerated words for the current dominant RNN decoder without new significant innovation. Under the premise of preserving the original structure, a promising alternative to address the aforementioned issue is to incorporate with auxiliary neural networks to extend the RNN decoder.
Towards above goal, polishing mechanismbased methods first capture the global information through a complete draft created by SMT or NMT, and then take it as input to finally generate a translation. Compared with conventional NMT, polishing mechanism-based methods make a more accurate prediction at each time-step due to the extra global understanding, resulting in more fluent and grammatically correct translation. While these approaches have demonstrated the effectiveness, previous approaches follow pre-defined routes to perform the decoding procedure, not considering choosing a suitable decoding depth for the complexity of source sentences completely.
Therefore, it's important to develop a novel framework for making an accurate and effective choice about which decoding depth is appropriate for the source sentence.

Adaptive Multi-pass Decoder
In this section, we present an adaptive multi-pass decoder for neural machine translation, as illustrated in Figure 1. It could choose a proper decoding depth, depending on the complexity of the source sentence. As shown in Figure 1, our model includes three major components: an encoder to summarize source sentences with parameter set θ e , a multi-pass decoder for multi-pass decoding with parameter set θ d , and a policy network to choose a suitable depth with parameter set θ p . The encoder of our model is identical to that of the dominant NMT which is modeled using a bidirectional RNN. Please refer to  for more details. We will elaborate the multi-pass decoder and policy network for adaptive multi-pass decoding in the following subsections.

Multi-pass Decoder
The multi-pass decoder is extended from the one of the dominant NMT model to leverage the target-side context. Similar to the dominant NMT model, our multi-pass decoder also performs the decoding under the semantic guide of source-side context captured by the encoder, whereas more importantly and differently, the global understanding through the target-side context provided by last pass decoding, is able to strongly assist our model to produce a better translation. Given the source-side and target-side contexts separately captured by the encoder and last pass decoding, the multi-pass decoder learns to generate next target word, based on previous generated words. Using the multi-pass decoder with parameter set θ d , we calculate the conditional probability of the translationŷ l at the l-th decoding pass as follows: where g dec (·) is a non-linear function, and s l,dec n denotes the n-th decoding state within the l-th decoding pass. N l indicates the length of generated translation at the l-th decoding pass. The decoding state s l,dec n is obtained by RNNs as follows: where f dec (·) is the GRU activation function. c l,enc n and c l,dec n denote source-side and target-side contexts at the n-th time step within the l-th decoding pass, respectively. It should be noted that when the multi-pass decoder performs the first decoding, there doesn't exist any generated translation. To address this case, the first-pass target-side context c 1,dec is set to zero.
Among the aforementioned contexts, c l,enc n is obtained as the weighted sum of the source-side hidden states {h m }, while we take the target-side hidden states {s l−1,dec n } produced by last pass decoding as input to compute c l,dec n . Similar to the dominant NMT model, we adopt the attention model Luong et al., 2015) to calculate the weights, which indicate the alignment probability. We assume that attn enc denotes the encoder-decoder attention model, which takes the source annotations {h m } as input, while attn dec are introduced to calculate the weight which measures how well the decoding state s l,dec attends the last-pass hidden states {s l−1,dec n }. Assuming s a indicates the decoding state, which attends the annotations {s b k } with a length K, our attention model calculates the context vector s c as follows: where v a , W a and U a are the parameters of attention model. Given a training (x, y), the translation route can be demonstrated as: are generated by decoding. Given a training corpus D = {x, y}, we define the object function using cross-entropy at last pass decoding as follows:  Figure 1: The architecture of our adaptive multi-pass decoder. Given the annotation sequence produced by the encoder, a policy network is adopted to choose a suitable action from the set {Continue, Stop}, which indicates continuing next pass decoding, or halt respectively. Different from the conventional decoder which only obtains the source-side context with the source attention model attn enc , our multi-pass decoder also captures the targetside context of last-pass decoding with decoder attention model attn dec . The policy network also use attn policy to collect useful information from the multi-pass decoding to choose an accurate and effective action to generate a good translation. Note that in this work the same parameters set of decoder and the corresponding attention is shared among different decoding passes. For this figure, we demonstrate a translation procedure with 3-pass decoding controlled by adaptive multi-pass decoder.
L (x,y) indicates the decoding depth for the instance (x, y). For effectiveness, note that all the intermediate translations {ŷ l } are generated by greedy search in training and testing phase.

Policy Network
The multi-pass decoding can be converted into sequential decision making process, in which a policy is adopted to choose next pass decoding or halt. It's expected to automatically choose an accurate and effective decoding depth to generate a good translation. For example, if the source sentence is exhausted to obtain the corresponding translation such as the long sentences, we assume more decoding passes are needed to improve the translation, while only one pass decoding is enough to tackle the simple case.
Our main idea is to use reinforcement learning to control the decoding depth. We parameterize the available action a l ∈ {Continue, Stop}, where Continue and Stop indicate continuing next decoding pass and halt respectively, by a policy network π(a l |s policy l ; θ p ), where s policy l represents the policy state at the l-th decoding pass. For the purpose of making a better choice about the decoding depth and direction, it's necessary to consider whether or not the source sentence is easy to obtain a good translation and compared with the last pass decoding, whether the quality of translation can be improved. Thus, supervised by this guideline, the policy state s policy l is calculated by GRU to model the difference between the consecutive two decoding passes as follows: where f policy is the activation function, and m l captures the useful information with respect to the policy network at the l-th decoding pass. In this work, we use the attention models attn policy to collect the decoding progress, denoted as m l of the l-th decoding pass. In order to take account of the complexity of source sentence itself, the initial policy state s policy 0 is computed by s policy where h M is last state source annotations, and W init is the parameters of initializing the policy state. Finally, we take the policy state s policy l as input to calculate the policy as follows: π(a l |s policy l ; θ p ) = sof tmax(W p s policy l + b p ) (12) where W p and b p are the parameters of the policy network. In this work we use REINFORCE algorithm (Williams, 1992), which is an instance of a broader class of algorithms called policy gradient methods (Sutton and Barto, 1998), to learn the parameter set θ p such that the sequence of actions a = {a 1 , . . . , a l , . . . , a L (x,y) } maximizes the total expected reward. The expected reward for an instance is defined as: J policy (θ p ) = E π(a|s policy ;θp) r(ŷ L (x,y) ) (13) where r(ŷ L (x,y) ) is the reward at the L (x,y) -th decoding pass. In this work, we use BLEU (Papineni et al., 2002) of the final translationŷ L (x,y) generated by greedy search as input to compute our reward as follows:

Experiments
In this section, we describe experimental settings and report empirical results.

Setup
We evaluated the proposed adaptive multipass decoder on Chinese-English translation task. The evaluation metric was case-insensitive BLEU (Papineni et al., 2002)  To effectively train the NMT model, we trained each model with sentences of length up to 50 words. Besides, we limited vocabulary size to 30K for both languages and map all the out-ofvocabulary words in the Chinese-English corpus to a special token UNK. We applied Rmsprop (Graves, 2013) to train models and selected the best model parameters according to the model performance on the development set. During this procedure, we set the following hyper-parameters: word embedding dimension as 620, hidden layer size as 1000, learning rate as 5 × 10 −4 , batch size as 80, gradient norm as 1.0, and dropout rate as 0.3.
In the experiments, we compared our approach against the following state-of-the-art SMT and NMT systems: 1 https://github.com/mosessmt/mosesdecoder/blob/master/scripts/generic/multibleu.perl 2 The training corpus includes LDC2002E18, LDC2003E07, LDC2003E14, part of LDC2004T07, LDC2004T08 and LDC2005T06 1. Moses 3 : an open source phrase-based translation system with default configuration and a 4-gram language model trained on the target portion of training data. Note that we used all data to train MOSES (Koehn et al., 2007).

2.
RNNSearch: a variant of the attention-based NMT system  with slight changes from dl4mt tutorial 4 .
3. Deliberation Network 5 : a re-implementation of attention-based NMT system with two independent left-to-right decoders (Xia et al., 2017). The first-pass decoder is identical to one of RNNSearch to generate a draft translation, while the second-pass decoder polishes it with an extra attention over the first pass decoder. The second-pass decoder is integrated with the first-pass decoder via reinforcement learning.

ABDNMT:
As a comparison with the Deliberation Network, ABDNMT utilizes firstpass backward decoder to generate a translation with greedy search, and the secondpass forward decoder refines it with attention model (Zhang et al., 2018). For fairness, we replace the first-pass backward decoder with a forward decoder.
We set the beam size of all above-mentioned models as 10 in our work. Deliberation Network and ABDNMT were initialized with the pretrained RNNSearch as Xia et al. (2017) and Zhang et al. (2018) described. Our multi-pass decoder was also initialized with RNNSearch and other parameters were randomly initialized from a uniform distribution on [−0.1, 0.1]. Besides, for effectiveness, we set the maximum decoding depth of our adaptive multi-pass decoder as 5.

Results on Chinese-English Translation
The experimental results of our model and baseline models on Chinese-English machine translation datasets are depicted in Table 2 .  Table 2: Evaluation of the NIST Chinese-English translation task. The BLEU scores are case-insensitive. "Params" denotes the number the parameters in each model. The "Speed" denotes the generation speed in seconds on the development set. RNNSearch is an attention-based neural machine translation model  with one-pass left-to-right decoding. RNNSearch(R2L) is a variant of RNNSearch with one-pass right-to-left decoding. As a comparison, Deliberation Network (Xia et al., 2017) and ABDNMT (Zhang et al., 2018) involve two independent decoders to adopt polishing mechanism to extend the ability of conventional NMT. Deliberation Network utilizes two left-to-right decoders coupled with reinforcement learning. However, ABDNMT exploits a backward decoder to perform first-pass right-to-left decoding. {2,3,4,5}-pass decoder utilizes our multi-pass decoder with a fixed number of decoding passes. Furthermore, adaptive multi-pass decoder involves a policy network to enhance our multi-pass decoder to choose a proper decoding depth.  Table 2 presents. More importantly, our proposed multipass decoder obtains much better performance with an increase of only 3.82M parameters over RNNSearch. As a comparison with Deliberation Network involves two-pass decoding, the multipass decoder has a minimum increase of 0.24 BLEU score. Nevertheless, our multi-pass decoder proves its effectiveness due to the less parameters consumption of 37.35M in contrast to Deliberation Network. These results verify our hypothesis that the more decoding passes can polish the generated output to improve the translation quality. The underlying reason is that the attention component attn dec within our multi-pass decoder can capture the extra target-side contexts to obtain a global understanding to assist the translation procedure. Towards the effect of the decoding depth set {2,3,4,5}, our multi-pass decoders obtain the approximate results, but the whole curve of BLEU is on an upward trend. Specifically, the multi-pass decoder with decoding depth 5 achieves the best performance with 38.64 BLEU, while the one with decoding depth 3 performs the worst among the decoding depth set with 38.55 BLEU. Although the average results of {2,3,4,5}-pass decoder are approximate, the distinction of {2,3,4,5}-pass decoder on NIST03, NIST04, NIST05, NIST06 and NIST08 is not negligible. These results indirectly prove the necessity of flexibility mechanism.
Adaptive Decoding Depth our proposed adaptive multi-pass decoder involves an extra policy network which controls the decoding depth according to the complexity of the source sentence and the differences between the consecutive generated translations. As shown in Table  2, the proposed adaptive multi-pass decoder obtains an improvement about 0.41 to 0.5 BLEU on average over the {2,3,4,5}-pass decoder, which demonstrates the effectiveness of the policy network. Specifically, the adaptive multi-pass decoder outperforms the multi-pass decoder with a fixed decoding depth by 0.69, 0.71, 0.68 and 0.45 BLEU scores on NIST03, NIST04, NIST05 and NIST06 datasets at most. In contrast to the Moses, RNNSearch, Deliberation Network and ABDNMT, the adaptive multi-pass decoder has the corresponding improvement about 8.03, 1.55, 0.74 and 0.34 BLEU points, respectively. More importantly, our adaptive multi-pass decoder outperforms ABDNMT, Deliberation Network model with a decrease of 26.85M, 29.15M parameters.
In order to further demonstrate the effective- ness of adaptively choosing the decoding depth, we investigate the ratio of decoding passes consumed by our multi-pass decoder on the development dataset, as shown in Table 3. Our adaptive multi-pass decoder chooses one-pass decoding in a high ratio of 46.57%, while in most about 53.43% cases our model leverages more than one pass decoding to produce a translation. The average decoding depth of our model is calculated as: (1 × 46.36% + 2 × 20.84% + 3 × 13.10% + 4 × 13.55% + 5 × 6.15%) = 2.12. Moreover, our ratio of the samples tends to decrease as the decoding depth rises on a whole. Since time consumption correlates with decoding depth, our adaptive multi-pass decoder proves its superior performance due to fewer parameters and less decoding passes.
Depth 1 2 3 4 5 Ratio(%) 46.57 20.45 13.00 13.60 6.38 Time Consumption Due to the multi-pass decoding mechanism, the major limitation of our proposed multi-pass decoder is time cost. In training phrase, we spend more time training the multipass decoder than RNNSearch, Deliberation Network and ABDNMT. However, in testing phrase, as illustrated in Table 2, our adaptive multi-pass decoder spends about 180s completing the entire testing procedure, in comparison with the corresponding 87s, 162s, 132s of RNNSearch, Deliberation Network and ABDNMT, due to the auxiliary policy network. These results are consistent with above conclusion drew according to the decoding Figure 3: Ratio of decoding depth set {1,2,3,4,5} controlled by our adaptive multi-pass decoder with respect to each length segment of the source sentences on the development dataset. depth. Therefore, it's proven the necessity of our proposed auxiliary policy network to choose the decoding depth.

Effect of Source Sentence Length
Following , we group sentences of similar lengths together and compute the BLEU score for each group, as shown in Figure 2. Obviously, our proposed adaptive multi-pass decoder outperforms RNNSearch in all length segments. Compared with {2,3,4,5}-pass decoders, our adaptive multi-pass decoder outperforms most even all the multi-pass decoders with fixed decoding depth in the length segments.
For the purpose of investigating the flexibility of policy network, we calculate the ratios of decoding depth set {1,2,3,4,5} on each sentence group with similar length, as illustrated in Figure 3. The ratio of one-pass decoding remains high level on each length segment, but explicitly is dominant on the length segment [0, 10). In contrast, the ratios of remaining decoding depths show upwards trends on a whole. These results indirectly proves that our policy network has the capability of choosing the proper decoding depth. That is, when the source sentence is difficult to be translated such as some long sentences, more decoding passes are consumed to improve the translation quality, while in simple cases such as short source sentences, one-pass decoding is adequate.
Reference xinhua news agency, beijing, april 5 , bill gates , the all -famous microsoft chairman, was duped by a canadian radio station the the other day and fell a victim to a big prank on april fools ' day.
1st-pass xihua news agency report of april 5 th from beijing (by staff reporter UNK UNK ) -the president of microsoft 's microsoft corporation , gates , was recently " UNK " by a radio station in canada and was hit by a UNK day on the day of the day .
2nd-pass xinhua news agency, beijing, april 5, a fews days ago, microsoft 's president , microsoft corporation , was " UNK " by a radio station in canada .
3rd-pass xinhua news agency, beijing, april 5 , microsoft 's president bill gates, the president of microsoft , was " UNK " by a radio station in canada in few days ago .
4th-pass xinhua news agency, beijing, april 5 , microsoft 's president bill gates, the president of microsoft , was " UNK " by a radio station in canada in few days ago .
5th-pass xinhua news agency, beijing, april 5 , microsoft 's president bill gates, the president of microsoft , was " UNK " by a radio station in canada in few days ago . tion example. Our proposed adaptive multi-pass decoder has the ability to polish the generated hypothesis again and again. As shown in Table 4, we force our adaptive multi-pass decoder to perform the multi-pass decoding with fixed depth sets {1,2,3,4,5}. The translation quality has an upwards trend with decoding depth 1 to 3, and the decoding with depth set {4,5} generates the identical translation as the decoding depth 3. Moreover, given the same source sentence, we use the proposed adaptive multi-pass decoder to choose the decoding depth. As expected, our adaptive multi-pass chooses 3-pass decoding which generates best translation and consumes least time, rather than {4,5}-pass decoding. Therefore, these results proves the effectiveness of our adaptive multi-pass decoder.

Related Work
In this work, we mainly focus on how to adopt adaptive polishing mechanism into NMT model, which has attracted intensive attention in recent years. We will elaborate polishing mechanismbased methods in the following pages. The polishing mechanism-based approaches first generate a complete draft, and then improve the quality of it based on the global understanding of the whole draft. A related work is post-editing (Niehues et al., 2016;Chatterjee et al., 2016;Zhou et al., 2017;Junczys Dowmunt and Grundkiewicz, 2017): a source sentence e is first translated to f , and then f is refined by another model. Niehues et al. (2016) used phrase-based statistical machine translation (PBMT) to pre-translate the source sentence into target language, which was taken as input of NMT to generate the final translation. Zhou et al. (2017) combined phrasebased statistical machine translation (PBMT), hi-erarchical phrase-based statistical machine translation (HPMT) and NMT with a unified architecture, similar to the dominant NMT model. Compared with the dominant NMT model, two attention models were involved to compute the context vectors. Specifically, an attention model is utilized to calculate the context vector for each machine system, while the other attention model obtains the context vector over the all context vectors of machine systems.
In above works, the generating and refining are two separate processes. As a comparison, Xia et al. (2017) proposed deliberation network, which consists of two decoders: a first-pass decoder generates a draft, which is taken as input of secondpass decoder to obtain a better translation. All the components of deliberation network are coupled together and jointly optimized in an end-toend way via reinforcement learning. Instead of first-pass forward decoder, Zhang et al. (2018) adopted a backward decoder to capture the rightto-left target-side contexts, which is taken as input to assist the second-pass forward decoder to obtain a better translation. Besides, the another difference with deliberation network is the secondpass decoder is integrated with the first-pass decoder without reinforcement learning.
For the purpose of exploring polishing mechanism, our model adopts adaptive multi-pass decoding strategy. Compared with the previous works which consumes no more than two decoding passes, our multi-pass decoder makes an attempt to perform the multi-pass decoding. More importantly, we adopt adaptive decoding depth controlled by policy network to extend the capacity of our multi-pass decoder.

Conclusion
In this paper, we propose a novel architecture called adaptive multi-pass decoder to adopt polishing mechanism into the NMT model via reinforcement learning. Towards this goal, a novel multi-pass decoder is introduced to generate the translation, conditioned on the source-and targetside contexts. Simultaneously, the multi-pass decoding is supervised by a policy network which learns to choose a suitable action from continuing next pass decoding or halt at each time step to maximize the BLEU of the final translation. As a result, our model has the capability of controlling the decoding depth to generate a better translation. Extensive experiments on Chinese-English translation demonstrate the effectiveness of the proposed adaptive multi-pass decoder.
In this paper, we focus on utilizing multi-pass decoder to polish the translation. Our proposed multi-pass decoder performs the multi-pass decoding mechanism with only forward decoding. One promising direction is to incorporate the backward decoding into our architecture. More specifically, we can extend the policy network to choose the backward decoding except for forward decoding and halting.