Improving Autoregressive NMT with Non-Autoregressive Model

Autoregressive neural machine translation (NMT) models are often used to teach non-autoregressive models via knowledge distillation. However, there are few studies on improving the quality of autoregressive translation (AT) using non-autoregressive translation (NAT). In this work, we propose a novel Encoder-NAD-AD framework for NMT, aiming at boosting AT with global information produced by NAT model. Specifically, under the semantic guidance of source-side context captured by the encoder, the non-autoregressive decoder (NAD) first learns to generate target-side hidden state sequence in parallel. Then the autoregressive decoder (AD) performs translation from left to right, conditioned on source-side and target-side hidden states. Since AD has global information generated by low-latency NAD, it is more likely to produce a better translation with less time delay. Experiments on WMT14 En-De, WMT16 En-Ro, and IWSLT14 De-En translation tasks demonstrate that our framework achieves significant improvements with only 8% speed degeneration over the autoregressive NMT.


Introduction
Neural machine translation (NMT) based on encoder-decoder framework has gained rapid progress over recent years (Sutskever et al., 2014;Bahdanau et al., 2015;Wu et al., 2016;Gehring et al., 2017;Vaswani et al., 2017;. All these high-performance NMT models generate target languages from left to right in an autoregressive manner. An obvious limitation of autoregressive translation (AT) is that the inference process can hardly be parallelized, and the inference time is linear with respect to the length of the target sequence.
To speed up the inference of machine translation, non-autoregressive translation (NAT) models have been proposed, which generate all target tokens independently and simultaneously (Gu et al., 2017;Lee et al., 2018;Kaiser et al., 2018;Libovický and Helcl, 2018). Although NAT is successfully trained with the help from an AT model as its teacher via knowledge distillation (Kim and Rush, 2016), there is no work focusing on improving the quality of AT using NAT. Therefore, a natural question arises, can we boost AT with NAT?
In this paper, we propose a novel and effective Encoder-NAD-AD framework for NMT, in which the newly added non-autoregressive decoder (NAD) can provide target-side global information when autoregressive decoder (AD) translates, as illustrated in Figure 1. Briefly speaking, the encoder is first used to encode the source sequence into a sequence of vector representations. NAD then reads the encoder representations and generates a coarse target sequence in parallel. Given the source-side and target-side contexts separately captured by the encoder and NAD, AD learns to generate final translation token by token.
Our proposed model can fully combine two major advantages compared to previous work (Vaswani et al., 2017;Xia et al., 2017). On the one hand, due to the lower latency during inference of NAT, the decoding efficiency of our proposed framework is only slightly lower than the standard NMT models, as shown in Figure 1. On the other hand, since AD can asses the global target-side context provided by NAD, it has the potential to generate a better translation by fully exploiting source-side and target-side contexts. We conduct massive experiments on WMT14 En⇒De, WMT16 En⇒Ro and IWSLT14 De⇒En translations tasks. Experimental results demonstrate that our proposed model achieves substantial improvements with only 8% degradation in decoding efficiency compared to the standard NMT.

The Framework
Our goal in this work is to improve autoregressive NMT using the non-autoregressive model with lower latency during inference. Figure 2 shows the model architecture of the proposed framework. Next, we will detail individual components and introduce an algorithm for training and inference.

The Neural Encoder
The neural encoder of our model is identical to that of the dominant Transformer model, which is modeled using the self-attention network. The encoder is composed of a stack of N identical layers, each of which has two sub-layers: where the superscript l indicates layer depth, h l denotes the source hidden state of l-th layer, LN is layer normalization, FFN means feed-forward networks, and MHAtt denotes the multi-head attention mechanism (Vaswani et al., 2017).

Non-Autoregressive Decoder
We initialize the non-autoregressive decoder inputs using copied source inputs from the encoder side by the fertility mechanism (Gu et al., 2017). For each layer in non-autoregressive decoder, the lowest sublayer is the unmasked multi-head self-attention network, and it also uses residual connections around each of the sublayers, followed by layer normalization.
The second sub-layer is a positional attention. We follow (Gu et al., 2017) and use the positional encoding p as both query and key and the decoder states as the value: The third sub-layer is Enc-NAD cross-attention that integrates the representation of corresponding source sentence, and the fourth sub-layer is a FFN: where h N is the source hidden state of top layer.

Autoregressive Decoder
For each layer in autoregressive decoder, the lowest sub-layer is the masked multi-head self-attention network: The second sub-layer is NAD-AD cross-attention that integrates non-autoregressive sequence context into autoregressive decoder: In addition, the decoder both stacks Enc-AD crossattention and FFN sub-layers to seek task-relevant input semantics to bridge the gap between the input and output languages: data: where y nad is the reference of NAT, which can be obtained from standard NMT model via sequencelevel knowledge distillation (Gu et al., 2017;Lee et al., 2018;Wang et al., 2019), and λ is a hyperparameter used to balance the preference between the two terms. Once our model is trained, we use the decoding algorithm shown in Figure 1 to translate source language with little time wasted over the autoregressive NMT.

Experiments
We use 4-gram NIST BLEU (Papineni et al., 2002) as the evaluation metric, and sign-test (Collins et al., 2005) to test for statistical significance.

Model Settings
We build the described models modified from the open-sourced tensor2tensor 5 toolkit. For our proposed model, we employ the Adam optimizer with β 1 =0.9, β 2 =0.998, and =10 −9 . For En⇒De and En⇒Ro, we use the hyperparameter settings of base Transformer model as Vaswani et al. (2017), whose encoder and decoder both have 6 layers, 8 attention-heads, and 512 hidden sizes. We follow Gu et al. (2017) to use the same small Transformer setting for IWSLT14 because of its smaller dataset. For evaluation, we use argmax decoding for NAD, and beam search with a beam size of k=4 and length penalty α=0.6 for AD. We also re-implement and compare with deliberate network (Xia et al., 2017) based on strong Transformer, which adopts the two-pass decoding method and uses the autoregressive decoding manner for the first decoder.

Results and Analysis
In this section, we evaluate and analyze the proposed approach on En⇒De, En⇒Ro, and De⇒En translation tasks.

Model Complexity
We first compare the model parameters and training speed in De⇒En for Transformer baseline, deliberate network, and our proposed model, which have 10.3M, 16.3M, and 18.0M parameters, respectively. Although our model uses more parameter than deliberate network due to additional position attention network, its training speed is significantly faster than deliberate network (1.8 steps/s vs. 0.7 steps/s) Translation Quality We report the translation performance in Table 1, from which we can make the following conclusions: (1) Our proposed model (row 10) significantly outperforms Transformer baseline (row 8) by 0.59, 0.89, and 1.14 BLEU points in three translation tasks, respectively. (2) Compared to the existing deliberate network which uses greedy search for the one-pass decoding, our model can obtain a comparable performance. (3) Our NAT model (row 9) can achieve a competitive or even better model accuracy than previous NAT models (rows 1-3).
Decoding Speed Table 2 shows the decoding efficiency of different models. The deliberate network achieves the translation improvement at the cost of the substantial drop in decoding speed (68% degeneration). However, due to the high efficiency during inference of non-autoregressive models (16× speedup than Transformer), the decoding efficiency of our proposed framework is only slightly lower (8% degeneration) than the standard autoregressive Transformer models.
Case Study To better understand how our model works, we present a translation example sampled form De⇒En task in Table 3. The standard AT model incorrectly translates the phrase "geschrieben sein könnte" into "may be", and omits word "geschrieben". This problem is well ad-Source ich sage dann mit meinen eigenen worten, was zwischen diesem gerüst :::::::::: geschrieben ::::: sein :::::: könnte . Reference then i will say , in my own words , what ::::: could ::: be ::::::: written within this framework . AT i then say to my own words , which :::::: may be between that framework . NAT i i say with my own words , which ::::: could ::: be ::::::: written between this scaffold . Our Model i then say , in my own words , what ::::: could :: be ::::::: written between this framework ? dressed by the Encoder-NAD-AD framework, since AD can access the global information contained in the draft sequence generated by NAD, and therefore outputs a better sentence.

Related Work
There are many design choices in the encoderdecoder framework based on different types of layers, such as RNN-based (Sutskever et al., 2014), CNN-based (Gehring et al., 2017), and selfattention based (Vaswani et al., 2017) approaches. Particularly, relying entirely on the attention mechanism, the Transformer introduced by Vaswani et al. (2017) can improve the training speed as well as model performance.
In term of speeding up the decoding of the neural Transformer, Gu et al. (2017) modified the autoregressive architecture to directly generate target words in parallel. In past two years, non-autoregressive and semi-autoregressive models have been extensively studied (Oord et al., 2017;Kaiser et al., 2018;Lee et al., 2018;Libovický and Helcl, 2018;Wang et al., 2019;Guo et al., 2018;Zhou et al., 2019a). Previous work shows that NAT can be improved via knowledge distillation from AT models. In contrast, the idea of improving AT with NAT is not well explored.
The most relevant to our proposed framework is deliberation network (Xia et al., 2017), which leverages the global information by observing both back and forward information in sequence decoding through a deliberation process. Recently, Zhang et al. (2018) proposed asynchronous bidirectional decoding for NMT (ABD-NMT), which extended the conventional encoder-decoder framework by introducing a backward decoder. Different from ABD-NMT, synchronous bidirectional sequence generation model perform left-to-right decoding and right-to-left decoding simultaneously and interactively (Zhou et al., 2019b;. Besides, Geng et al. (2018) introduced a adaptive multi-pass decoder to standard NMT models. However, the above models improve translation quality while greatly reducing inference efficiency.

Conclusion
In this work, we propose a novel Encoder-NAD-AD framework for NMT, aiming at improving the quality of autoregressive decoder with global information produced by the newly added nonautoregressive decoder. We extensively evaluate the proposed model on three machine translation tasks (En⇒De, En⇒Ro, and De⇒En). Compared to existing deliberation network (Xia et al., 2017) which suffers from serious decoding speed degradation, our proposed model achieves a significant improvement in translation quality with little degradation of decoding efficiency compared to the stateof-the-art autoregressive NMT.