Imitation Learning for Non-Autoregressive Neural Machine Translation

Non-autoregressive translation models (NAT) have achieved impressive inference speedup. A potential issue of the existing NAT algorithms, however, is that the decoding is conducted in parallel, without directly considering previous context. In this paper, we propose an imitation learning framework for non-autoregressive machine translation, which still enjoys the fast translation speed but gives comparable translation performance compared to its auto-regressive counterpart. We conduct experiments on the IWSLT16, WMT14 and WMT16 datasets. Our proposed model achieves a significant speedup over the autoregressive models, while keeping the translation quality comparable to the autoregressive models. By sampling sentence length in parallel at inference time, we achieve the performance of 31.85 BLEU on WMT16 Ro\rightarrowEn and 30.68 BLEU on IWSLT16 En\rightarrowDe.


Introduction
Neural machine translation (NMT) with encoderdecoder architectures (Sutskever et al., 2014;Cho et al., 2014) achieve significantly improved performance compared with traditional statistical methods (Koehn et al., 2003;Koehn, 2010). Nevertheless, the autoregressive property of the NMT decoder has been a bottleneck of the translation speed. Specifically, the decoder, whether based on Recurrent Neural Network (RNN) (Hochreiter and Schmidhuber, 1997;Cho et al., 2014) or attention mechanism (Vaswani et al., 2017), sequentially generates words. The latter words are conditioned on previous words in a sentence. Such bottleneck disables parallel computation of decoder, which is serious for NMT, since the NMT decoding with a large vocabulary is extremely time-consuming. * This work was done when the first author was on an internship at Tencent. Recently, a line of research work (Gu et al., 2017;Lee et al., 2018;Libovick and Helcl, 2018;Wang et al., 2018) propose to break the autoregressive bottleneck by introducing non-autoregressive neural machine translation (NAT). In NAT, the decoder generates all words simultaneously instead of sequentially. Intuitively, NAT abandon feeding previous predicted words into decoder state at the next time step, but directly copy source encoded representation (Gu et al., 2017;Lee et al., 2018;Guo et al., 2018;Wang et al., 2019) as inputs of the decoder. Thus, the generation of the NAT models does not condition on previous prediction. NAT enables parallel computation of decoder, giving significantly fast translation speed with moderate accuracy (always within 5 BLEU). Figure 1 shows the difference between autoregressive and non-autoregressive models.
However, we argue that current NAT approaches suffer from delayed supervisions (or rewards) and large search space in training. NAT decoder simultaneously generates all words of the translation, the search space of which is very large. For one time step, decoding states across layers (more than 16 layers) and time steps could be regarded as a 2-dimensional sequential decision process. Every decoding state has not only to decide which part of target sentence it will focus on, but also to decide the correct target word of that part. All decisions are made by interactions with other decoding states. Delayed supervisions (correct target word) will be obtained by decoding states in the last layer, and intermediate decoding states will be updated by gradient propagation from the last layer. Therefore, the training of NAT is non-trivial and it may be hard for NAT to achieve a good model, which is the same case that reinforcement learning (Mnih et al., 2013(Mnih et al., , 2015 is hard to learn with large search space. The delayed supervision problem is not severe for autoregressive neural machine translation(AT) because it predicts words sequentially. Given the previous words, contents to be predicted at current step are relatively definite, thus the search space of AT is exponentially lower than NAT. We blame the delayed supervision and large search space for the performance gap between NAT and AT.
In this paper, we propose a novel imitation learning framework for non-autoregressive NMT (imitate-NAT ). Imitation learning has been widely used to alleviate the problem of huge search space with delayed supervision in RL. It is straightforward to bring the imitation learning idea for boosting the performance of NAT. Specifically, we introduce a knowledgeable AT demonstrator to supervise each decoding state of NAT model. In such case, Specifically, We propose to employ a knowledgeable AT demonstrator to supervise every decoding state of NAT across different time steps and layers, which works pretty well practically. Since the AT demonstrator is only used in training, our proposed imitate-NAT enjoys the high speed of NAT without suffering from its relatively lower translation performance.
Experiments show that our proposed imitate-NAT is fast and accurate, which effectively closes the performance gap between AT and NAT on several standard benchmarks, while maintains the speed advantages of NAT (10 times faster). On all the benchmark datasets, our imitate-NAT with LPD achieves the best translation performance, which is even close to the results of the autoregressive model.

Background
In the following sections, we introduce the background about Autoregressive Neural Machine Translation and Non-Autoregressive Neural Machine Translation.

Autoregressive Neural Machine Translation
Sequence modeling in machine translation has largely focused on autoregressive modeling which generate a target sentence word by word from left to right, denoted by p θ (Y |X), where X = {x 1 · · · , x T } and Y = {y 1 , · · · , y T } represent the source and target sentences as sequences of words respectively. θ is a set of parameters usually trained to minimize the negative loglikelihood: where T and T is the length of the source and the target sequence respectively.
Deep neural network with autoregressive framework has achieved great success on machine translation, with different choices of architectures. The RNN-based NMT approach, or RNMT, was quickly established as the de-facto standard for NMT. Despite the recent success, the inherently sequential architecture prevents RNMTs from being parallelized during training and inference. Following RNMT, CNNs and self-attention based models have recently drawn research attention due to their ability to fully parallelize training to take advantage of modern fast computing devices. However, the autoregressive nature still creates a bottleneck at inference stage, since without ground truth, the prediction of each target token has to condition on previously predicted tokens.

Non-Autoregressive Neural Machine Translation
As a solution to the issue of slow decoding, Gu et al. (2017) recently proposed non-autoregressive model (NAT) to break the inference bottleneck by exposing all decoder inputs to the network simultaneously. NAT removes the autoregressive connection directly and factorizes the target distribution into a product of conditionally independent per-step distributions. The negative loglikelihood loss function for NAT model become is then de- The approach breaks the dependency among the target words across time, thus the target distributions can be computed in parallel at inference time.
In particular, the encoder stays unchanged from the original Transformer network. A latent fertility model is then used to copy the sequence of source embeddings as the input of the decoder. The decoder has the same architecture as the encoder plus the encoder attention. The best results were achieved by sampling fertilities from the model and then rescoring the output sentences using an autoregressive model. The reported inference speed of this method is 2-15 times faster than a comparable autoregressive model, depending on the number of fertility samples.
This desirable property of exact and parallel decoding however comes at the expense of potential performance degradation. Since the conditional dependencies within the target sentence (y t depends on y <t ) are removed from the decoder input, the decoder is not powerful enough to leverage the inherent sentence structure for prediction. Hence the decoder has to figure out such target-side information by itself just with the source-side information during training, which leads to a larger modeling gap between the true model and the neural sequence model. Therefore, strong supervised signals could be introduced as the latent variable to help the model learn better internal dependencies within a sentence.
In AT models, the generation of the current token is conditioned on previously generated tokens , which provides strong target side context information. In contrast, NAT models generate tokens in parallel, thus the target-side dependency is indirect and weak. Consequently, the decoder of a NAT model has to handle the translation task conditioned on less and weaker information compared with its AT counterpart, thus leading to inferior accuracy.

Proposed Method: imitate-NAT
In this section, we propose an imitation learning framework (imitate-NAT ) to close the perfor- mance gap between the NAT and AT.

Preliminary of imitate-NAT
We bring the intuition of imitation learning to nonautoregressive NMT and adapt it to our scenario. Specifically, the NAT model can be regarded as a learner, which will imitate a knowledgeable demonstrator at each decoding state across layers and time steps. However, obtaining an adequate demonstrator is non-trivial. We propose to employ an autoregressive NMT model as the demonstrator, which is expected to offer efficient supervision to each decoding state of the NAT model. Fortunately, the AT demonstrator is only used in training, which guarantees that our proposed imitate-NAT enjoys the high speed of NAT model without suffering from its relatively lower performance.
In following parts, we will describe the AT demonstrator and the NAT learner in our imitate-NAT framework, respectively.

AT Demonstrator
For the proposed AT, we apply a variant of the transformer model as the demonstrator, named DAT. The encoder stays unchanged from the original Transformer network. A crucial difference lies in that the decoder introduces the imitation module which emits actions at every time step. The action brings sequential information, thus can be used as the guidance signal during the NAT training process.
The input of each decoder layer O = {o 1 , o 2 , · · · , o T } can be considered as the observation (or environment) of the IL framework, where donates the layer of the observation. Let A = {a 1 , a 2 , · · · , a T } ∈ A denotes an action sequence from the action space A. The action space A is finite and its size n is a hyperparameter, representing the number of action categories. The distribution of the action of DAT can be then fed to the NAT model as the training signal. Let Π denotes a policy class, where each π ∈ Π generates an action distribution sequence A in response to a context sequence O .
Predicting actions A may depend on the contexts of previous layer O and policies π can thus be viewed as mapping states to actions. A roll-out of π given the context sequence O to determine the action sequence A , which is: The distribution π (o t ) represents the probability of the decision depends on the current state or environment o t . The discrete operation arg max(·) suffers from the non-differentiable problem which makes it impossible to train the policy from an end to end framework. Note that unlike the general reinforcement or imitation learning framework, we consider to compute the action state which as the expectation of the embedding of the action a t : where δ(a t ) ∈ R k returns the embedding of the action a t and k denotes the embedding dimension. The states of next layer are then based on the current output of the decoder state and the emitted action state: where Transfer(·) denotes the vanilla transformer decoding function including a self-attention layer, an encoder-decoder attention layer and followed by a FFN layer (Vaswani et al., 2017).

Action Distribution Regularization
The supervised signal for the action distribution π(o t ) is not direct in NAT, thus the action prediction can be viewed as an unsupervised clustering problem. One potential issue is the unbalanced distribution of action. Inspired by Xie et al. (2016), we introduce a regularization method to increase the space utilization. Formally, an moving average c is applied to calculate the cumulative activation level for each action category: We set α 0.9 in our experiments. Then π (o i ) can be re-normalized with the cumulative history c: The convex property of the quadratic function can adjust the distribution to achieve the purpose of clustering. The role of c is to redistribute the probability distribution of π(o t ), which leads to a more balanced category assignment. We define our objective as a KL divergence loss between π(o t ) and the auxiliary distribution π (o t ) as follows: 3.3 NAT learner

Soft Copy
To facility the imitation learning process, our imitate-NAT is based on the AT demonstrator described in section 3.2. The only difference lies in that the initialization of the decoding inputs. Previous approaches apply a UniformCopy method to address the problem. More specifically, the decoder input at position t is the copy of the encoder embedding at position Round(T t/T ) (Gu et al., 2017;Lee et al., 2018). As the source and target sentences are often of different lengths, AT model need to predict the target length T during inference stage. The length prediction problem can be viewed as a typical classification problem based on the output of the encoder. we follow Lee et al. (2018) to predict the length of the target sequence. The proposed Round function is unstable and non-differentiable, which make the decoding task difficult. We therefore propose a differentiable and robust method named SoftCopy following the spirit of the attention mechanism (Hahn and Keller, 2016;Bengio, 2009). The weight w i,j depends on the distance relationship between the source position i and the target position j.
τ is a trainable parameters used to adjust the degree of focus when copying. Then the input of the target at position j can be computed as : where x i is usually the source embedding at position i. It is also worth mentioning that we take the top-most hidden states instead of the word embedding as x i in order to cache the global context information.

Learning from AT Experts
The conditional independence assumption prevents NAT model from properly capturing the highly multimodal distribution of target translations. AT models takes already generated target tokens as inputs, thus can provide complementary extension information for NAT models. A straightforward idea to bridge the gap between NAT and AT is that NAT can actively learn the behavior of AT step by step. The AT demonstrator generate action distribution π AT (O) ∈ R n as the posterior supervisor signal. We expect the supervision information can guide the generation process of NAT. The imitate-NAT exactly follows the same decoder structure with our AT demonstrator, and emits distribution π N AT (O) ∈ R n to learn from AT demonstrator step by step. More specifically, we try to minimize the cross entropy of the distributions between the two policies:

Training
In the training process, the action distribution regularization term described in 3.2.1 is combined with the commonly used cross-entropy loss in Eq. 1: For NAT models, the imitation learning term are combined with the commonly used cross-entropy loss in Eq. 2: where λ 1 and λ 2 are hyper-parameters, which are set to 0.001 in our experiments.  (Gu et al., 2017), (Lee et al., 2018) and (Kaiser et al., 2018) respectively. imitate-NAT is our proposed NAT with imitation learning.

Experiments
We evaluate our proposed model on machine translation tasks and provide the analysis. We present the experimental details in the following, including the introduction to the datasets as well as our experimental settings.

Knowledge Distillation Datasets
Sequencelevel knowledge distillation is applied to alleviate multimodality in the training dataset, using the AT demonstrator as the teachers (Kim and Rush, 2016). We replace the reference target sentence of each pair of training example (X, Y ) with a new target sentence Y * , which is generated from the teacher model(AT demonstrator). Then we use the new dataset (X, Y * ) to train our NAT model. To avoid the redundancy of running fixed teacher models repeatedly on the same data, we decode the entire training set once using each teacher to create a new training dataset for its respective student.
Model Settings We first train the AT demonstrator and then freeze its parameters during the training of imitate-NAT . In order to speed up the convergence of NAT training, we also initialize imitate-NAT with the corresponding parameters of the AT expert as they have similar architecture. For WMT14 En-De and WMT16 En-Ro, we use the hyperparameter settings of base Transformer model in Vaswani et al. (2017)(d model = 512, d hidden = 512, n layer = 6 and n head = 8). As in Gu et al. (2017); Lee et al. (2018), we use the small model (d model = 278, d hidden = 507, n layer = 5 and n head = 2) for IWSLT16 En-De. For sequence-level distillation, we set beam size to be 4. For imitate-NAT , we set the number of action category to 512 and found imitate-NAT is robust to the setting in our preliminary experiments.
Length Parallel Decoding For inference, we follow the common practice of noisy parallel decoding (Gu et al., 2017), which generates a number of decoding candidates in parallel and selects the best translation via re-scoring using AT teacher. In our scenario, we first train a module to predict the target length asT . However, due to the inherent uncertainty of the data itself, it is hard to accurately predict the target length. A reasonable solution is to generate multiple translation candidates by predicting different target length ∈ [T − ∆T,T + ∆T ] , which we called LPD (length parallel decoding). The model generates several outputs in parallel, then we use the pre-trained autoregressive model to identify the best overall translation.

Results and Analysis
Competitor We include three NAT works as our competitors, the NAT with fertility (NAT-FT) (Gu et al., 2017), the NAT with iterative refinement (NAT-IR) (Lee et al., 2018) and the NAT with discrete latent variables (Kaiser et al., 2018). For all our tasks, we obtain the baseline performance by either directly using the performance figures reported in the previous works if they are available or producing them by using the open source implementation of baseline algorithms on our datasets. The results are shown in Table 1.
1. imitate-NAT significantly improved the quality of the translation with a large margin. On all the benchmark datasets, our imitate-NAT with LPD achieves the best translation performance, which is even close to the results of the autoregressive model, e.g. 30.68 vs. 30.85 on IWSLT16 En→De tasks, and 31.81vs. 32.59 on WMT16 Ro→En tasks. It is also worth mentioning that introducing the imitation module to AT demonstrator does not affect both the performance and the inference speed compared with the standard transformer model.
2. imitate-NAT Imitation learning plays an important role on bridging the gap between imitate-NAT and AT demonstrator Clearly, imitate-NAT leads to remarkable improvements over the competitor without imitation module (over almost 3 BLEU score on average). To make a fair comparison, the competitor follow exactly the same training steps with imitate-NAT , including the initialization, knowledge distillation, and Soft-Copy. The only difference comes from the imitation module.
3. imitate-NAT gets better latency. For NAT-FT, a big sample size(10 and 100) is required to get satisfied results, which seriously affects the inference speed of the model. Both NAT-FT and NAT-IR, the efficiency of models with refinement technique drop dramatically(15.6× → 2.36× of NAT-FT and 8.9× → 1.5× of NAT-IR). Our imitate-NAT gets even better performance with faster speed. The speedup compared with AT model is 9.7×.

Ablation Study
To further study the effects brought by different techniques, we show in Table 2 the translation performance of different NAT model variants for the IWSLT16 En-De translation task.
Soft-Copy v.s. Uniform-Copy The experimental results show that Soft-Copy is better than Uniform-Copy. Since Uniform-Copy employs a hard copy mechanism and directly copies the source embeddings without considering the global information, which increases the learning burden of the decoder. Our model takes the output of encoder as input and proposes a differentiable copy mechanism which gets much better results(25.34 vs. 20.71, see in line 3 and 2).
Imitation Learning v.s. Non Imitation Learning The imitation learning method leads to an improvement of around 3 BLEU points(28.41 vs. 25.34, see line 6 and 3). NAT without IL degenerates into a normal NAT model. As discussed in section 1, current NAT approaches suffer from delayed supervisions (or rewards) and large search space in training. NAT decoder simultaneously generates all words of the translation, the search space of which is very large.
Length Parallel Decoding Compared with the greedy beam search, LPD technique improves the performance around 2 BLEU points(30.68 vs. 28.41, from line 7 and 6). The observation is in consist with our intuition that sampling from the length space can improve the performance.

Complementary with Knowledge Distillation
In consist with previous work, NAT models achieved +4.2 BLEU score from sequence level knowledge distillation technique (see in row 1 and row 2). imitate-NAT without knowledge distillation obtained 23.56 BLEU score which is comparable to non-imitation NAT with knowledge distillation (see in row 3 and row 4). More importantly, we found that the imitation learning framework complemented with knowledge distillation perfectly. As shown in row 3 and 6, imitate-NAT substantially improves the performance of nonimitation NAT knowledge distillation up by +3.3 BLEU score.   Redistribute method leads to a more balanced distribution(blue), otherwise, it will be extremely unbalanced(red).
Action Distribution Study One common problem in unsupervised clustering is that the results are unbalanced. In this paper, we call that an action is selected or activated when its probability in π(o t ) is maximum. Then the space usage can be calculated by counting the number of times each action is selected. We evaluate the space usage on the development set of IWSLT16, and the results are presented in Figure 4. We greatly alleviate the problem of space usage through the category redistribution technique(Eq.7, Eq.8). When building the model without category redistribution, most of the space is not utilized, and the clustering results are concentrated in a few spatial locations, and the category information cannot be dynamically and flexibly characterized. In contrast, category redistribution makes the category distribution more balanced and more in line with the inherent rules of the language, so the clustering results can effectively guide the learning of the NAT model.
6 Related Work Gu et al. (2017) first developed a nonautoregressive NMT system which produces the outputs in parallel and the inference speed is thus significantly boosted. However, it comes at the cost that the translation quality is largely sacrificed since the intrinsic dependency within the natural language sentence is abandoned. A bulk of work has been proposed to mitigate such performance degradation. Lee et al. (2018) proposed a method of iterative refinement based on latent variable model and denoising autoencoder. Libovick and Helcl (2018) take NAT as a connectionist temporal classification problem, which achieved better latency. Kaiser et al. (2018) use discrete latent variables that makes decoding much more parallelizable. They first auto encode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from the shorter latent sequence in parallel. Guo et al. (2018) enhanced decoder input by introducing phrase table in SMT and embedding transformation. Wang et al. (2019) leverage the dual nature of translation tasks (e.g., English to German and German to English) and minimize a backward reconstruction error to ensure that the hidden states of the NAT decoder are able to recover the source side sentence.
Unlike the previous work to modify the NAT architecture or decoder inputs, we introduce an imitation learning framework to close the performance gap between NAT and AT. To the best of our knowledge, it is the first time that imitation learning was applied to such problems.

Conclusion
We propose an imitation learning framework for non-autoregressive neural machine translation to bridge the performance gap between NAT and AT. Specifically, We propose to employ a knowledgeable AT demonstrator to supervise every decoding state of NAT across different time steps and lay-ers. As a result, imitate-NAT leads to remarkable improvements and largely closes the performance gap between NAT and AT on several benchmark datasets.
As a future work, we can try to improve the performance of the NMT by introducing more powerful demonstrator with different structure (e.g. right to left). Another direction is to apply the proposed imitation learning framework to similar scenarios such as simultaneous interpretation.