Smart-Start Decoding for Neural Machine Translation

Most current neural machine translation models adopt a monotonic decoding order of either left-to-right or right-to-left. In this work, we propose a novel method that breaks up the limitation of these decoding orders, called Smart-Start decoding. More specifically, our method first predicts a median word. It starts to decode the words on the right side of the median word and then generates words on the left. We evaluate the proposed Smart-Start decoding method on three datasets. Experimental results show that the proposed method can significantly outperform strong baseline models.


Introduction
Neural machine translation (NMT) has made remarkable progress in recent years. There has been much progress in encoder-decoder framework, including recurrent neural models (Wu et al., 2016), convolutional models (Gehring et al., 2017) and self-attention models (Vaswani et al., 2017). Particularly, the Transformer, only relying on selfattention networks, has achieved state-of-the-art performance on different benchmarks.
Most encoder-decoder frameworks generate target translation in a completely monotonic order from left to right (L2R) or from right to left (R2L). However, monotonic generation is not always the best translation order for the machine translation task. As shown in Figure 1, "乐 (happy)" needs to leverage the future context "开朗 (lively)" to make disambiguation of the translation in English sentence, because "乐" has two meanings: "happy to do something" and "Le (person name)". In this example, the L2R baseline model produced an incorrect translation of "Le (person name)" due to unseen future context. * Contribution during internship at Microsoft Research Asia. † Corresponding author. There are some related works on non-monotonic text generation (Mehri and Sigal, 2018;Welleck et al., 2019;Gu et al., 2019;Zhou et al., 2019b,a). Inspired by these works, we are extremely interested in considering choosing one proper position to start decoding instead of L2R or R2L order. We propose a novel method called the Smart-Start decoding method. Specifically, our method starts the generation of target words from the right part of the sentence "Yang Sen has a lively personality .", followed by the generation of the left part of the sentence "Chatting with people ,". The intuition is that humans do not always translate the sentence from the first word to the last word. Instead, humans may translate different parts of the sentence before organizing the whole translation.
As shown in Figure 1, our Smart-Start method predicts the word "Yang" in the median position of the target sentence, together with the following words of the right part of the sentence "Yang Sen has a lively personality .". Once our model produces the specific symbol "[m]" which is designed to indicate the termination of the right part generation, we will start predicting the left part of the sentence "Chatting with people ,". Finally, we obtain the final translation from the intermediate translation by solely placing the right part "Yang Sen has a lively personality ." in front of the left part and removing the additional symbol "[m]".
We introduce a weighted maximum likelihood algorithm to automatically learn this kind of decoding order by giving weights to translations with different start positions.
To verify the effectiveness of our method, we conduct experiments on three benchmarks, including IWSLT14 German-English, WMT14 English-German, and LDC Chinese-English translation tasks. Experimental results show that our method outperforms monotonic and non-monotonic baselines. In conclusion, we propose a simple but effective method, which predicts from the median words to the last position's word followed by the word predictions on the left part of the sentence.

Smart-Start Machine Translation
In this section, we present the details of the proposed hard and soft Smart-Start methods. Our method first predicts a median word and then predicts the words on the right part, and then generates words on the left.

Method
Our method is split into two phases. First, given the source sentence X = (x 1 , x 2 , . . . , x m ), we use the model P θ (Z k |X) to predict the intermediate translation Z k starting from the middle position of the sentence, where Z k = (y n−k+1 , . . . , y n , [m], y 1 , . . . , y n−k ) and "[m]" is the k th word of Z k . Second, we construct the final translation Y from the the intermediate translation Z k . As shown in Figure 2, our method predicts a word y n−k+1 , given the source sentence. Then our model predicts the right part of sentence (y n−k+1 , . . . , y n ) at a time. Furthermore, when it predicts the symbol "[m]", we start predicting the left part of the sentence (y 1 , . . . , y n−k ). Then, we obtain the final translation Y from the intermediate translation Z k . Our method is based on the Transformer architecture.

Smart-Start Decoding
Our Smart-Start method is extremely interested in breaking up the limitation of this decoding order. Different from the traditional L2R and R2L (Sennrich et al., 2016a), our Smart-Start method predicts median word y n−k+1 over the source sentence. Furthermore, we predict the right part of target sentence (y n−k+1 , . . . , y n ) sequentially which is on the right part of this word. Finally, we generate the rest words (y 1 , . . . , y n−k ) on the left part of the sentence given the source sentence and left part. Formally, we build our Smart-Start neural machine translation model as below: where i,j denote the i th and j th words in the target sentence.
[m] is the k th word of Z k .

Smart-Start Training
Since there is no annotation of initial words to start the decoding, we construct the intermediate sentences with different start positions and then score them with hard or soft Smart-Start methods.
Therefore, given the source sentence X of length m and target sentence Y of length n, we can construct n intermediate sentences Because the target sentence length n can be too long, we randomly sample S intermediate sentences from n intermediate sentences to construct the subset S Y , where S is the number of sampled start positions. We apply scores calculated by the hard or soft Smart-Start methods to the loss of different intermediate samples to teach model which start position is better. This procedure can be described by the weighted log-likelihood (WML) (Dimitroff et al., 2013) reward function L over the dataset D as below: where S Y is the subset containing S samples. w k is calculated by the hard or soft Smart-Start methods.
For the hard Smart-Start method, we use the median training loss of intermediate samples as threshold to select appropriate samples to update model parameters. We calculate w k by comparing the training loss generated by the current model of each Z k from S Y with the threshold as below: where Z trans k is the intermediate translation generated by the current training model P θ (Z k |X) using the teacher forcing method. Z k is the intermediate sentence from S Y .

Experiments
In this section, we evaluate our method on three popular benchmarks.

Dataset
IWSLT14 De-En corpus contains 16K training sequence pairs. The valid and test set both contain 7K sentence pairs. LDC Zh-En corpus is from the LDC corpus. The training data contains 1.4M sentence pairs. NIST 2006 is used as the valid set. NIST 2002NIST , 2003NIST , 2005NIST , 2008NIST , and 2012 are used as test sets. WMT14 En-De corpus has 4.5M sentence pairs. The newstest2013 and the newstest2014 are used as valid the test set. All languages are tokenized by Moses (Koehn et al., 2007) and our Chinese tokenizer, and then encoded using byte pair encoding (BPE) (Sennrich et al., 2016b) with 40K merge operations. The evaluation metric is BLEU (Papineni et al., 2002).

Training Details
We conduct experiments on 8 NVIDIA 32G V100 GPUs and set batch size as 1024 tokens. In the training stage, we adopt the Adam optimizer For the LDC Zh→En translation task, we use the Transformer_base setting with the embedding size as 512 and feed-forward network (FFN) size as 2048. For the IWSLT14 De→En translation task, we use the Transformer_small setting with embedding size as 512 and FFN size as 1024. The dropout is set as 0.3 and weight decay as 0.0001 to prevent overfitting. For the WMT14 En→De translation task, we use the Transformer_big setting with embedding size as 1024 and FFN size as 4096. Following the previos work (Ott et al., 2018), we accumulate the gradient for 16 iterations to simulate a 128-GPU environment.

Baselines and Results
We compare our method with the other baselines, including Transformer (Vaswani et al., 2017), RP Transformer (Shaw et al., 2018) Table 1: Case-insensitive evaluation results on LDC Zh→En translation task with BLEU-4 scores (%). The "Avg" column means the averaged result of all NIST test sets. All baselines are re-implemented by ourselves.  Conv/DynamicConv , and SB-NMT (Zhou et al., 2019a). For the results of IWSLT14 De→En in Table 2 and LDC Zh→En machine translation tasks in Table 1, our soft method significantly gets an improvement of +0.98/+1.71 BLEU points than a strong Transformer model. For the WMT14 En→De task, the results of our model are presented in Table 3. Besides, we also compare our method with other self-attention models. The SB-NMT model gets a BLEU points of 29.21 which decodes from L2R and R2L simultaneously and interactively. Our method achieves an improvement of +0.56 BLEU points over the Transformer baseline. Besides, our soft Smart-Start method outperforms the SB-NMT model by +0.80

Discussions and Analysis
Number of Sampled Start Positions To explore the effect of the number of sampled start positions S described as Equation 2, we conduct experiments on the IWSLT14 De→En translation task. Figure  3 shows that our hard and soft Smart-Start methods have gradually improved performance by increasing the value of S. Soft Smart-Start method outperforms the hard method under different settings. The soft method achieves a higher BLEU score when the number of sampled start positions equals 7. The proper interval (4 ≤ S ≤ 12) is recommended to use in our method. In conclusion, the soft Smart-Start method can bring a more positive influence on BLEU scores.  Figure 4 shows that other positions in the sentence also occupy a certain proportion. Therefore, the conventional left-to-right decoding order is not always the best decoding order, and starting from other positions is beneficial for translation quality, which verifies our motivation.

Distribution of Start Positions
Linguistic Analysis Based on the Figure 4, we further try making linguistic analysis. Three pictures show that the [m] tends to occur in the 1 th position, where the intermediate translation is Z 1 = (y n , [m], y 1 , . . . , y n−1 ). We observe that y n mostly is the punctuation such as period, question mark, and exclamation mark under this situation. Conjunction and preposition words are also inclined to appear at the beginning of sentences such as "or" and "but", which indicates clauses are easier to be placed at the beginning. It is consistent with our intuition that punctuation marks are most easy to predict at first.  Training Time The Transformer baseline costs nearly 0.9 hours and our method costs nearly 1.8 hours (only ×2 lower speed) on the IWSLT-2014 De→En translation task, where both experiments are conducted on the 8-V100-GPU environment with 1024 max tokens. Our method doesn't require many additional training steps to converge compared with the Transformer baseline. Our method outperforms the Transformer baseline by +0.8 BLEU points. Another factor affecting the training time is the number of sampled start positions. We also investigate the proper value of the number of sampled start positions. In practice, smaller value such as 4 or 6 can also bring significant improvements. Therefore, we choose a smaller value of the sampled start positions and use multiple GPUs to keep the training time in a reasonable range.

Conclusion
In this work, we propose a novel method that breaks up the limitation of these decoding orders, called Smart-Start decoding. Our method predicts a median word and then generates the words on the right part. Finally, it generates words on the left. Experimental results show that our Smart-Start method significantly improves the quality of translation.