Learning to Recover from Multi-Modality Errors for Non-Autoregressive Neural Machine Translation

Non-autoregressive neural machine translation (NAT) predicts the entire target sequence simultaneously and significantly accelerates inference process. However, NAT discards the dependency information in a sentence, and thus inevitably suffers from the multi-modality problem: the target tokens may be provided by different possible translations, often causing token repetitions or missing. To alleviate this problem, we propose a novel semi-autoregressive model RecoverSAT in this work, which generates a translation as a sequence of segments. The segments are generated simultaneously while each segment is predicted token-by-token. By dynamically determining segment length and deleting repetitive segments, RecoverSAT is capable of recovering from repetitive and missing token errors. Experimental results on three widely-used benchmark datasets show that our proposed model achieves more than 4 times speedup while maintaining comparable performance compared with the corresponding autoregressive model.


Introduction
Although neural machine translation (NMT) has achieved state-of-the-art performance in recent years (Cho et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017), most NMT models still suffer from the slow decoding speed problem due to their autoregressive property: the generation of a target token depends on all the previously generated target tokens, making the decoding process intrinsically nonparallelizable.
Recently, non-autoregressive neural machine translation (NAT) models (Gu et al., 2018;Guo et al., 2019a;Wei et al., 2019)   slow decoding speed problem by generating all target tokens independently in parallel, speeding up the decoding process significantly. Unfortunately, these models suffer from the multi-modality problem (Gu et al., 2018), resulting in inferior translation quality compared with autoregressive NMT.
To be specific, a source sentence may have multiple feasible translations, and each target token may be generated with respect to different feasible translations since NAT models discard the dependency among target tokens. This generally manifests as repetitive or missing tokens in the translations. Table 1 shows an example. The German phrase "viele Farmer" can be translated as either "lots of farmers" or "a lot of farmers". In the first translation (Trans. 1), "lots of" are translated w.r.t. "lots of farmers" while "of farmers" are translated w.r.t. "a lot of farmers" such that two "of" are generated. Similarly, "of" is missing in the second translation (Trans. 2). Intuitively, the multi-modality problem has a significant negative effect on the translation quality of NAT. Intensive efforts have been devoted to alleviate the above problem, which can be roughly divided into two lines. The first line of work leverages the iterative decoding framework to break the independence assumption, which first generates an initial translation and then refines the translation  The segments are generated simultaneously while each segment is generated token-by-token conditioned on both the source tokens and the translation history of all segments (e.g., the token "are" in the first segment is predicted based on all the tokens colored green). Repetitive segments (e.g., the third segment "lots of") are detected and deleted automatically.
iteratively by taking both the source sentence and the translation of last iteration as input (Lee et al., 2018;Ghazvininejad et al., 2019). Nevertheless, it requires to refine the translations for multiple times in order to achieve better translation quality, which hurts decoding speed significantly. The other line of work tries to improve the vanilla NAT model to better capture target-side dependency by leveraging extra autoregressive layers in the decoder (Shao et al., 2019a;Wang et al., 2018), introducing latent variables and/or more powerful probabilistic frameworks to model more complex distributions (Kaiser et al., 2018;Akoury et al., 2019;Shu et al., 2019;Ma et al., 2019), guiding the training process with an autoregressive model Wei et al., 2019), etc. However, these models cannot alter a target token once it has been generated, which means these models are not able to recover from an error caused by the multi-modality problem.
To alleviate the multi-modality problem while maintaining a reasonable decoding speedup, we propose a novel semi-autoregressive model named RecoverSAT in this work. RecoverSAT features in three aspects: (1) To improve decoding speed, we assume that a translation can be divided into several segments which can be generated simultaneously.
(2) To better capture target-side dependency, the tokens inside a segment is autoregressively generated conditioned not only on the previously generated tokens in this segment but also on those in other segments. On one hand, we observe that repetitive tokens are more likely to occur within a short context. Therefore, autoregressively generating a segment is beneficial for reducing repetitive tokens. On the other hand, by conditioning on previously generated tokens in other segments, the model is capable of guessing what feasible translation candidates have been chosen by each segment and adapts accordingly, e.g., recovering from missing token errors. As a result, our model captures more targetside dependency such that the multi-modality problem can be alleviated naturally. (3) To make the model capable of recovering from repetitive token errors, we introduce a segment deletion mechanism into our model. Informally speaking, our model will mark a segment to be deleted once it finds the content has been translated in other segments.
We conduct experiments on three benchmark datasets for machine translation to evaluate the proposed method. The experimental results show that RecoverSAT is able to decode over 4× faster than the autoregressive counterpart while maintaining comparable performance. The source code of this work is released on https://github.com/ ranqiu92/RecoverSAT.

Autoregressive Neural Machine Translation
Autoregressive neural machine translation (AT) generates the translation token-by-token conditioned on translation history. Denoting a source sentence as x = {x i } T i=1 and a target sentence as y = {y j } T j=1 , AT models the joint probability as: where y <t denotes the generated tokens before y t .
During decoding, the translation history dependency makes the AT model predict each token after all previous tokens have been generated, which makes the decoding process time-consuming.

Non-Autoregressive Neural Machine Translation
Non-autoregressive neural machine translation (NAT) (Gu et al., 2018) aims to accelerate the decoding process, which discards the dependency of translation history and models P (y|x) as a product of the conditionally independent probability of each token: The conditional independence enables the NAT models to generate all target tokens in parallel. However, independently predicting all target tokens is challenging as natural language often exhibits strong correlation across context. Since the model knows little information about surrounding target tokens, it may consider different possible translations when predicting different target tokens. The problem is known as the multi-modality problem (Gu et al., 2018) and significantly degrades the performance of NAT models.

Overview
RecoverSAT extends the original Transformer (Vaswani et al., 2017) to enable the decoder to perform generation autoregressively in local and non-autoregressively in global. An overview of the architecture of our RecoverSAT model is shown in Figure 1. As illustrated in the figure, RecoverSAT simultaneously predicts all segments "there are EOS", "lots of farmers EOS", "a lot DEL" and "doing this today EOS". And at each time step, it generates a token for each incomplete segment. The special token DEL denotes the segment should be deleted and EOS denotes the end of a segment. Combining all the segments, we obtain the final translation "there are lots of farmers doing this today".
Formally, assuming a translation y is generated as K segments S 1 , S 2 , · · · , S K , where S i is a subsequence of the translation 1 . For description simplicity, we assume that all the segments have the same length. RecoverSAT predicts a token for each segment conditioned on all previously generated tokens at each generation step, which can be formulated as: where S i t denotes the t-th token in the i-th segment, S i <t = {S i 1 , · · · , S i t−1 } denotes the translation history in the i-th segment, and L is segment length.
Here, two natural problems arise for the decoding process: • How to determine the length of a segment?
• How to decide a segment should be deleted?
We address the two problems in a uniform way in this work. Suppose the original token vocabulary is V , we extend it with two extra tokens EOS and DEL. Then for the segment S i , the most probable tokenŜ i t at time step t: has three possibilities: (1)Ŝ i t ∈ V : the segment S i is incomplete and the decoding process for it should continue; (2)Ŝ i t = EOS: the segment S i is complete and the decoding process for it should terminate; (3)Ŝ i t = DEL: the segment S i is repetitive and should be deleted. Accordingly, the decoding process for it should terminate.
The entire decoding process terminates when all the segments meet EOS/DEL or reach the maximum token number. It should be noticed that we do not explicitly delete a segment when DEL is encountered but do it via post-processing. In other words, the model is trained to ignore the segment to be deleted implicitly.

Learning to Recover from Errors
As there is little target-side information available in the early stage of the decoding process, the errors caused by the multi-modality problem is inevitable. In this work, instead of reducing such errors directly, we propose two training mechanisms to teach our RecoverSAT model to recover dynamically according to the sentence length. In other words, we can predict the target sentence length to determine the segment number during inference. In this case, our model can also decode in constant time. from errors: (1) Dynamic Termination Mechanism: learning to determine segment length according to target-side context; (2) Segment Deletion Mechanism: learning to delete repetitive segments.

Dynamic Termination Mechanism
As shown in Section 3.1, instead of pre-specifying the lengths of segments, we let the model determine the lengths by emitting the EOS token. This strategy helps our model recover from multi-modality related errors in two ways: 1. The choice of the first few tokens is more flexible. Taking Figure 1 as an example, if the decoder decides the first token of the second segment is "of" instead of "lots" (i.e., "lots" is not generated in the second segment), it only needs to generate "lots" before "EOS" in the first segment in order to recover from missing token errors. In contrast, if the decoder decides the first token is "are", it can avoid repetitive token error by not generating "are" in the first segment; 2. As shown in Eq. 3, a token is generated conditioned on all the previously generated tokens in all the segments. Therefore, the decoder has richer target-side information to detect and recover from such errors.
However, it is non-trivial to train the model to learn such behaviour while maintaining a reasonable speedup. On one hand, as the decoding time of our RecoverSAT model is proportional to the maximum length of the segments, we should divide the target sentences of training instances into equal-length segments to encourage the model to generate segments with identical length. On the other hand, the model should be exposed to the multi-modality related errors to enhance its ability of recovering from such errors, which suggests that the target sentences of training instances should be divided randomly to simulate these errors.
To alleviate the problem, we propose a mixed annealing dividing strategy. To be specific, we randomly decide whether to divide a target sentence equally or randomly at each training step and gradually anneal to the equally-dividing method at the end of training. Formally, given the target sentence y and the segment number K, we define the segment dividing indice set r as follows: where Bernoulli(p) is the Bernoulli distribution with parameter p, EQUAL(n, m) = n m+1 , 2n m+1 , · · · , mn m+1 , RAND(n, m) sampling m non-duplicate indices from [1, n]. A larger value of p leads to better error recovering ability while a smaller one encourages the model to generate segments with similar lengths (in other words, better speedup). To balance the two aspects, we gradually anneal p from 1 to 0 in the training process, which achieves better performance (Section 4.5).

Segment Deletion Mechanism
Although the dynamic termination mechanism makes the model capable of recovering from missing token errors and reducing repetitive tokens, the model still can not recover from errors where token repetition errors have already occurred. We find the major errors of our model occur when generating the first token of each segment since it cannot see any history and future. In this situation, two repetitive segments will be generated. To alleviate this problem, we propose a segment-wise deletion strategy, which uses a special token DEL to indicate a segment is repetitive and should be deleted 2 .
A straightforward way to train the model to learn to delete a segment is to inject pseudo repetitive segments into the training data. The following is an example: Given the target sentence "there are lots of farmers doing this today", we first divide it into 3 segments "there are", "lots of farmers" and "doing this today". Then we copy the first two tokens of the second segment and append the special token DEL to the end to construct a pseudo repetitive segment "lots of DEL". Finally, we insert the repetitive segment to the right of the chosen segment, resulting in 4 segments. Formally, given the expected segment number K and the target sentence y, we first divide y into K − 1 segments S 1 , S 2 , · · · , S K−1 and then build a pseudo repetitive segment S i rep by copying the first m tokens of a randomly chosen segment S i and appending DEL to the end, m is uniformly sampled from [1, |S i |]. Finally, S i rep is inserted at the right side of S i . The final K segments are S 1 , S 2 , · · · , S i , S i rep , S i+1 , · · · , S K−1 . However, injecting such pseudo repetitive segments to all training instances will mislead the model that generating then deleting a repetitive segment is a must-to-have behaviour, which is not desired. Therefore, we inject pseudo repetitive segment into a training instance with probability q in this work.

Datasets
We conduct experiments on three widely-used machine translation datasets: IWSLT16 En-De (196k pairs), WMT14 En-De (4.5M pairs) and WMT16 En-Ro (610k pairs). For fair comparison, we use the preprocessed datasets in Lee et al. (2018), of which sentences are tokenized and segmented into subwords using byte-pair encoding (BPE) (Sennrich et al., 2016) to restrict the vocabulary size. We use a shared vocabulary of 40k subwords for both source and target languages. For the WMT14 En-De dataset, we use newstest-2013 and newstest-2014 as validation and test sets respectively. For the WMT16 En-Ro dataset, we employ newsdev-2016 and newstest-2016 as validation and test sets respectively. For the IWSLT16 En-De dataset, we use test2013 as the validation set.

Experimental Settings
For model hyperparameters, we follow most of the settings in (Gu et al., 2018;Lee et al., 2018;Wei et al., 2019). For the IWSLT16 En-De dataset, we use a small Transformer model (d model = 278, d hidden = 507, n layer = 5, n head = 2, p dropout = 0.1). For the WMT14 En-De and WMT16 En-Ro datasets, we use a larger Transformer model (d model = 512, d hidden = 512, n layer = 6, n head = 8, p dropout = 0.1). We linearly anneal the learning rate from 3 × 10 −4 to 10 −5 as in Lee et al. (2018) for the IWSLT16 En-De dataset, while employing the warm-up learning rate schedule (Vaswani et al., 2017) with t warmup = 4000 for the WMT14 En-De and WMT16 En-Ro datasets. We also use label smoothing of value ls = 0.15 for all datasets. We utilize the sequence-level distillation (Kim and Rush, 2016), which replaces the target sentences in the training dataset with sentences generated by an autoregressive model, and set the beam size of the technique to 4. We use the encoder of the corresponding autoregressive model to initialize the encoder of RecoverSAT, and share the parameters of source and target token embedding layers and the pre-softmax linear layer. We measure the speedup of model inference in each task on a single NVIDIA P40 GPU with the batch size 1.

Baselines
We use the Transformer (Vaswani et al., 2017) as our AT baseline and fifteen latest strong NAT models as NAT baselines, including: (1)

Overall Results
The performance of our RecoverSAT model and the baselines is shown in Table 2. Due to the space limitation, we only show the results corresponding to the settings of the best BLEU scores for the baselines 3 . From Table 2, we can observe that: (1) Our RecoverSAT model achieves comparable performance with the AT baseline (Transformer) while keeping significant speedup. When K = 2, the BLEU score gap is moderate (from 0.06 to 0.4, even better than Transformer on the WMT16 En→Ro and Ro→En tasks) and the speedup is about 2×. When K = 10, the BLEU scores drop less than 5% relatively, and the speedup is considerably good (over 4×).
(2) Our RecoverSAT model outperforms all the strong NAT baselines except CMLM (on the WMT16 En→Ro and Ro→En tasks). However, the performance gap is negligible (0.16 and 0.12 respectively), and CMLM is a multi-step NAT method which is significantly slower than our model.   (Gu et al., 2018) and LPD denotes the length parallel decoding technique (Wei et al., 2019). n denotes the sample size of NPD or LPD. iter denotes the refinement number of the iterative decoding method.
(3) As K grows, the BLEU scores drop moderately and the speedup grows significantly, indicating that our RecoverSAT model has a good generalizability. For example, the BLEU scores drop less than 0.45 when K grows from 2 to 5, and drop no more than 0.90 except on the WMT14 De→En task when K further grows to 10. Meanwhile, the speedup for K = 10 is larger than 4×, which is considerably good.
(4) There are only 7 baselines (SynST, imitate-NAT+LPD, LV NAR, NART+LPD, FCL-NAT+NPD, ReorderNAT and NART-DCRF+LPD) achieving better speedup than our RecoverSAT model when K = 10. However, only Reorder-NAT and NART-DCRF+LPD achieve comparable BLEU scores with our model.The improvements of both ReorderNAT and NART-DCRF are complementary to our method. It is an interesting future work to join these works together.

Effect of Dynamic Termination Mechanism
As discussed in Section 3.2.1, the dynamic termination mechanism is used to train our RecoverSAT model to learn to determine segment length dynamically conditioned on target-side context such that it is recoverable from multi-modality related errors.
In this section, we investigate the effect of this mechanism and the results are shown in Table 3.
As multi-modality related errors generally manifest as repetitive or missing tokens in the translation, we propose two quantitative metrics "Rep" and "Mis" to measure these two phenomenons respectively. "Rep" is defined as the relative increment of repetitive token ratio w.r.t. to a reference AT model. And "Mis" is defined as the relative increment of missing token ratio given the references w.r.t. to a reference AT model. Formally, given the translationsŶ = {ŷ 1 · · ·ŷ k · · · } produced by the model to be evaluated and the trans-lationsŶ auto = {ŷ 1 auto · · ·ŷ k auto · · · } produced by the reference AT model, "Rep" is defined as where 1(cond) = 1 if the condition cond holds otherwise 0, and y k j is the j-th token of the translation sentence y k .
The results are evaluated on the IWSLT16 En-De validation set. p is the parameter of Bernoulli distribution in Eq. 5. "Rep" and "Mis" measure the relative increment (%) of repetitive and missing token ratios (see Section 4.5), the smaller the better. " Step" denotes the average number of decoding steps. And "1→0" denotes annealing p from 1 to 0 linearly.
where m(·, ·) computes the missing token ratio and is defined as follows: where c(y, w) is the occurrence number of a token w in the sentence y.
From Table 3, we can observe that: (1) By using the dynamic termination mechanism (p = 0.5, 1.0, 1 → 0, where p is the parameter of Bernoulli distribution (Eq. 5)), both repetitive and missing token errors are reduced ("Rep" & "Mis"), and the BLEU scores are increased, indicating the effectiveness of the mechanism; (2) As p grows larger, the average number of decoding steps ("Step") increases significantly. The reason is that more target sentences are divided into segments equally with smaller p during training and the model is biased to generate segments with similar lengths. However, if the model is not exposed to randomly divided segments (p = 0.0), it fails to learn to recover from multi-modality related errors and the BLEU score drops significantly. (3) By using the annealing dividing strategy (p = 1 → 0, see Section 3.2.1), we achieve a good balance between decoding speed and translation quality. Therefore, we use it as the default setting in this paper.

Effect of Segment Deletion Mechanism
In this section, we investigate the effect of the segment deletion mechanism and the results are shown in Table 4, where q is the probability of injecting pseudo repetitive segments to each training instance. From the results we can observe that: (1) Without using the segment deletion mechanism   (q = 0), the BLEU score drops significantly and the repetitive token errors ("Rep") increase drastically, indicating that the mechanism is effective for recovering from repetitive token errors. (2) As q grows larger, the average number of decoding steps ("Step") increases steadily because the model is misled that to generate then delete a repetitive segment is expected. Thus, q should not be too large.
(3) The repetitive token errors ("Rep") increase drastically when q > 0.7. We believe that the reason is that the pseudo repetitive segments are constructed randomly, making it hard to learn the underlying mapping. (4) The model achieves the best performance with q = 0.5. Therefore, we set q = 0.5 in our experiments. Figure 2 shows the translation quality of the Transformer, our RecoverSAT model with K = 10 and NAT on the IWSLT16 En-De validation set bucketed by different source sentence lengths. From the figure, we can observe that RecoverSAT surpasses NAT significantly and achieves comparable performance to the Transformer on all length buckets, which indicates the effectiveness of our model. Source die er greif endste Abteilung ist das Denk mal für die Kinder , das zum Ged enken an die 1,5 Millionen Kinder , die in den Konzent rations lagern und Gas k ammern vernichtet wurden , erbaut wurde .

Performance over Sentence Lengths
Reference the most tragic section is the children's mem orial , built in memory of 1.5 million children killed in concentration camps and gas cham bers .
NAT Translation the most tangible department department the monument monument the children , which was built commem commem orate 1.5 1.5 million children were destroyed in the concentration camps and gas cham bers .
RecoverSAT (K = 10)  Table 5: Translation examples of NAT and RecoverSAT. "Forced Translation" denotes the generated sentence when we manually force the model to generate a certain token (colored green) at a certain position. We use yellow color to label repetitive tokens, red color to label missing tokens, and gray color to label the segments to be deleted. We use " " to concatenate sub-words and subscript numbers (e.g., [1]) to mark the beginning of each segment.

Case Study
We present translation examples of NAT and our RecoverSAT model on the WMT14 De→En validation set in Table 5. From the table, we can observe that: (1) The multi-modality problem (repetitive and missing tokens) is severe in the sentence generated by NAT, while it is effectively alleviated by RecoverSAT (see translations A to D); (2) Recov-erSAT can leverage target contexts to dynamically determine the segment length to reduce repetitive token errors (see translation B) or recover from missing token errors (see translations C and D); (3) RecoverSAT is capable of detecting and deleting the repetitive segments, even if there are multiple such segments (see translation D).

Related Work
There has been various work investigating to accelerate the decoding process of sequence generation models (Kalchbrenner et al., 2018;Gu et al., 2018). In the field of neural machine translation, which is the focus of this work, Gu et al. (2018) first propose non-autoregressive machine translation (NAT), which generates all target tokens simultaneously. Although accelerating the decoding process significantly, NAT suffers from the multimodality problem (Gu et al., 2018) which generally manifests as repetitive or missing tokens in translation. Therefore, intensive efforts have been devoted to alleviate the multi-modality problem in NAT.  Wang et al. (2018) further propose a semi-autoregressive Transformer method, which generates segments autoregressively and predicts the tokens in a segment non-autoregressively. However, none of the above methods explicitly consider recovering from multi-modality related errors.
Recently, multi-step NAT models have also been investigated to address this issue. Lee et al. (2018) and Ghazvininejad et al. (2019) adopt an iterative decoding methods which have the potential to re-cover from generation errors. Besides, Stern et al. and Gu et al. (2019) also propose to use dynamic insertion/deletion to alleviate the generation repetition/missing. Different from these work, our model changes one-step NAT to a semi-autoregressive form, which maintains considerable speedup and enables the model to see the local history and future to avoid repetitive/missing words in decoding. Our work can further replace the one-step NAT to improve its performance.

Conclusion
In this work, we propose a novel semiautoregressive model RecoverSAT to alleviate the multi-modality problem, which performs translation by generating segments non-autoregressively and predicts the tokens in a segment autoregressively. By determining segment length dynamically, RecoverSAT is capable of recovering from missing token errors and reducing repetitive token errors. By explicitly detecting and deleting repetitive segments, RecoverSAT is able to recover from repetitive token errors. Experiments on three widely-used benchmark datasets show that our RecoverSAT model maintains comparable performance with more than 4× decoding speedup compared with the AT model.