A Study of Non-autoregressive Model for Sequence Generation

Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel, resulting in faster generation speed compared to their autoregressive (AR) counterparts but at the cost of lower accuracy. Different techniques including knowledge distillation and source-target alignment have been proposed to bridge the gap between AR and NAR models in various tasks such as neural machine translation (NMT), automatic speech recognition (ASR), and text to speech (TTS). With the help of those techniques, NAR models can catch up with the accuracy of AR models in some tasks but not in some others. In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all? (2) Why techniques like knowledge distillation and source-target alignment can help NAR models. Since the main difference between AR and NAR models is that NAR models do not use dependency among target tokens while AR models do, intuitively the difficulty of NAR sequence generation heavily depends on the strongness of dependency among target tokens. To quantify such dependency, we propose an analysis model called CoMMA to characterize the difficulty of different NAR sequence generation tasks. We have several interesting findings: 1) Among the NMT, ASR and TTS tasks, ASR has the most target-token dependency while TTS has the least. 2) Knowledge distillation reduces the target-token dependency in target sequence and thus improves the accuracy of NAR models. 3) Source-target alignment constraint encourages dependency of a target token on source tokens and thus eases the training of NAR models.


Introduction
Non-autoregressive (NAR) models (Oord et al., 2017;Gu et al., 2017;, which generate all the tokens in a target sequence in parallel and can speed up inference, are widely explored in natural language and speech processing tasks such as neural machine translation (NMT) (Gu et al., 2017;Guo et al., 2019a;Li et al., 2019b;Guo et al., 2019b), automatic speech recognition (ASR)  and text to speech (TTS) synthesis (Oord et al., 2017;. However, NAR models usually lead to lower accuracy than their autoregressive (AR) counterparts since the inner dependencies among the target tokens are explicitly removed.
Several techniques have been proposed to alleviate the accuracy degradation, including 1) knowledge distillation (Oord et al., 2017;Gu et al., 2017;Guo et al., 2019a,b;, 2) imposing source-target alignment constraint with fertility (Gu et al., 2017), word mapping (Guo et al., 2019a), attention distillation (Li et al., 2019b) and duration prediction . With the help of those techniques, it is observed that NAR models can match the accuracy of AR models for some tasks , but the gap still exists for some other tasks (Gu et al., 2017;. Therefore, several questions come out naturally: (1) Why the gap still exists for some tasks? Are some tasks more difficult for NAR generation than others? (2) Why the techniques like knowledge distillation and source-target alignment can help NAR generation?
The main difference between AR and NAR models is that NAR models do not consider the dependency among target tokens, which is also the root cause of accuracy drop of NAR models. Thus, to better understand NAR sequence generation and answer the above questions, we need to characterize and quantify the target-token dependency, which turns out to be non-trivial since the sequences could be of different modalities (i.e., speech or text). For this purpose, we design a novel model called COnditional Masked prediction model with Mix-Attention (CoMMA), inspired by the mix-attention in He et al. (2018) and the masked language modeling in Devlin et al. (2018): in CoMMA, (1) the prediction of one target token can attend to all the source and target tokens with mix-attention, and 2) target tokens are randomly masked with varying probabilities. CoMMA can help us to measure target-token dependency using the ratio of the attention weights on target context over that on full (both source and target) context when predicting a target token: bigger ratio, larger dependency among target tokens.
We conduct a comprehensive study in this work and obtain several interesting discoveries that can answer previous questions. First, we find that the rank of the target-token dependency among the three tasks is ASR>NMT>TTS: ASR has the largest dependency while TTS has the smallest. This finding is consistent with the accuracy gap between AR and NAR models and demonstrates the difficulty of NAR generation across tasks. Second, we replace the target sequence of original training data with the sequence generated by an AR model (i.e., through knowledge distillation) and use the new data to train CoMMA; we find that the targettoken dependency is reduced. Smaller target-token dependency makes NAR training easier and thus improves the accuracy. Third, source-target alignment constraint such as explicit duration prediction  or implicit attention distillation (Li et al., 2019b) also reduces the target-token dependency, thus helping the training of NAR models.
The main contributions of this work are as follows: • We design a novel model, conditional masked prediction model with mix-attention (CoMMA), to measure the token dependency for sequence generation.
• With CoMMA, we find that: 1) Among the three tasks, ASR is the most difficult and TTS is the least for NAR generation; 2) both knowledge distillation and imposing source-target alignment constraint reduce the target-token dependency, and thus reduce the difficulty of training NAR models.

CoMMA
In this section, we analyze the token dependency in the target sequence with a novel conditional masked prediction model with mix-attention (CoMMA). We first introduce the design and structure of CoMMA, and then describe how to measure the target token dependency based on CoMMA.

The Design of CoMMA
It is non-trivial to directly measure and compare the target token dependency in different modalities (i.e., speech or text) and different conditional source modalities (i.e., speech or text). Therefore, we have several considerations in the design of CoMMA: 1) We use masked language modeling in BERT (Devlin et al., 2018) with source condition to train CoMMA, which can help measure the dependency on target context when predicting the current masked token. 2) In order to ensure the dependency on source and target tokens can be comparable, we use mix-attention (He et al., 2018) to calculate the attention weights on both source and target tokens in a single softmax function. The model architecture of CoMMA is shown in Figure 1. Specifically, CoMMA differs from standard Transformer (Vaswani et al., 2017) as follows: 1) Some tokens are randomly replaced by a special mask token M with probability p, and the model is trained to predict original unmasked tokens. 2) We employ mix-attention mechanism (He et al., 2018) where layer i in the decoder can attend to itself and the layer i in the encoder at the same time and compute the attention weights in a single softmax function. We share the parameters of attention and feed-forward layer between the encoder and decoder. 3) Following He et al. (2018), we add source/target embedding to tell the model whether a token is from the source or target sequence, and also add position embedding with the positions of source and target tokens both starting from zero. 4) The encoder and decoder pre-net (Shen et al., 2018)   with ReLU activation. For ASR, encoder pre-net consists of 3-layer 2D convolutional network, and decoder pre-net consists of only embedding lookup table. For NMT, both encoder and decoder pre-net consist of only embedding lookup table.
CoMMA is designed to measure the target token dependency in a variety of sequence generations, including AR (unidirectional) generation, NAR generation, bidirectional generation or even identity copy. To this end, we vary the mask probability p (the ratio of the masked tokens in the whole target tokens 1 ) in a uniform distribution p ∼ U (0.0, 1.0) when training CoMMA. In this way, p = 1 covers NAR generation, p = 0 covers identity copy, and in some cases, p can also cover AR generation.

How to Measure Target Token Dependency based on CoMMA
To measure the target token dependency, we define a metric called attention density ratio R, which represents the ratio of the attention density (the normalized attention weights) on target context in mix-attention when predicting the target token with a well-trained CoMMA. We describe the calculation of R in the following steps. First, we define the attention density ratio α for a single target token i as where A i,j denotes the attention weights from token i to token j in mix-attention, and i ∈ [1, N ] represents the target token while j ∈ [N + 1, N + M ] represents the source token, M and N is the length of source and target sequence respectively, N +M j=1 A i,j = 1. α i represents the ratio of attention density on target context when predicting target token i.
Second, we average the attention density ratio α i over all the predicted tokens (with masked probability p) in a sentence and get where M p represents the set of masked target tokens under mask probability p and |M p | denotes the number of tokens in the set. Third, for a given p, we calculate R(p) over all test data and average them to get the final attention density ratio We vary p and calculate R(p) to measure the density ratio under different conditions, where a small p represents more target context that can be leveraged and a large p represents less context. In the extreme cases, p = 1 represent NAR generation while p = 0 represents to learn identity copy. Given the proposed attention density ratio R(p) based on CoMMA, we can measure the target token dependency of the NAR model in different tasks,  which can help understand a series of important research questions, as we introduce in the following three sections.

Study on the Difficulty of NAR Generation
In this section, we aim to find out why the gap still exists for ASR and NMT tasks, while in TTS, NAR can catch up with the accuracy of AR model. We also analyze the causes of different difficulties for different tasks. We start from evaluating the accuracy gap between AR and NAR models for NMT, ASR and TTS, and then measure the token dependency based on our proposed CoMMA.

The Accuracy Gap
We first train the AR and NAR models in each task and check the accuracy gap between AR and NAR models to measure the difficulty of NAR generation in each task.

Configuration of AR and NAR Model
The AR and NAR models we considered are shown in Table 1, where we use Transformer as the AR models while the representative NAR models in each task. For a fair comparison, we make some modifications on the NAR models: 1) For ASR, we train a Transformer ASR first as teacher model and then constrain the attention distributions of NAR-ASR with the alignments converted by teacher attention weights, which will be introduced and discussed in Section 5. 2) For NMT, we constrain the KLdivergence of the encoder-to-decoder attention distributions between the AR and NAR models following Li et al. (2019b). We also list the hyperparameters of AR and NAR models for each task in Section A. speech data, we transform the raw audio into melspectrograms following Shen et al. (2018) with 50 ms frame size and 12.5 ms hop size. For text data, we tokenize sentences with moses tokenizer 3 and then segment into subword symbols using Byte Pair Encoding (BPE) (Sennrich et al., 2015) for subword-level analysis, and convert the text sequence into phoneme sequence with grapheme-to-phoneme conversion (Sun et al., 2019) for phoneme-level analysis. We use BPE for NMT and ASR, while phoneme for TTS by default unless otherwise stated. We train all models on 2 NVIDIA 2080Ti GPUs using Adam optimizer with β 1 = 0.9, β 2 = 0.98, ε = 10 −9 and following the same learning rate schedule in (Vaswani et al., 2017). For ASR, we evaluate word error rate (WER) on test-clean set in LibriTTS dataset. For NMT, we evaluate the BLEU score on IWSLT 2014 De-En test set. For TTS, we randomly split the LJSpeech dataset into 3 sets: 12500 samples for training, 300 samples for validation and 300 samples for testing, and then evaluate the mean opinion score (MOS) on the test set to measure the audio quality. The output mel-spectrograms of TTS model are transformed into audio samples using the pretrained WaveGlow (Prenger et al., 2019). Each audio is listened by at least 20 testers, who are all native English speakers.  Results of Accuracy Gap The accuracies of the AR and NAR models in each task are shown in

The Token Dependency
In the last subsection, we analyze the difficulty of NAR models from the perspective of the accuracy gap. In this subsection, we try to find evidence from the target token dependency, which is supposed to be consistent with the accuracy gap to measure the task difficulty.

Configuration of CoMMA
We train CoMMA with the same configuration on NMT, ASR and TTS: the hidden size and the feed-forward hidden size and the number of layers are set to 512, 1024 and 6 respectively. We list other hyperparameters of CoMMA in Section B. We also use the same datasets for each task as described in Section 3.1 to train CoMMA.

Results of Token Dependency
We use the attention density ratio calculated from CoMMA (as described in Section 2.2) to measure the target token dependency and show the results in Figure 2. It can be seen that the rank of attention density ratio R(p) is ASR>NMT>TTS for all p. Considering that R(p) measures how much context information from target side is needed to generate a target token, we can see that ASR has more dependency on the target context and less on the source context, while TTS is the opposite, which is consistent with the accuracy gap between AR and NAR models as we described in Section 3.1.
As we vary p from 0.1 to 0.5, R(p) decreases for all tasks since more tokens in the target side are masked. We also find that R(p) in NMT decreases quicker than the other two tasks, which indicates that NMT is good at learning from source context when less context information can be leveraged from the target side while R(p) in ASR decreases little. This can also explain why NAR in NMT achieves less gap than ASR.

Study on Knowledge Distillation
In the current and next sections, we investigate why some techniques can help NAR generation from the aspect of target token dependency. We only analyze knowledge distillation and attention alignment techniques which are widely used in NAR, but we believe our analysis method can be applied to other NAR techniques, such as iterative refinement , fine-tuning from an AR model (Guo et al., 2019b) and so on.
Most existing NAR models (Oord et al., 2017;Gu et al., 2017;Guo et al., 2019a,b; rely on the technique of knowledge distillation, which generates the new target sequence given original source sequence from a pre-trained AR model and trains the NAR model for better accuracy. In this section, we first conduct experiments to verify the accuracy improvements of knowledge distillation. Next, based on our proposed CoMMA, we analyze why knowledge distillation could help NAR models.

The Effectiveness of Knowledge Distillation
Knowledge Distillation for NAR Models Given a well-trained AR model θ T and source sequence x ∈ X from the original training data, a new target sequence can be generated through We can use beam search for NMT and ASR and greedy search for TTS to generate y . Given the set of generated sequence pairs (X , Y ), we train the NAR models with negative log-likelihood loss where θ is the parameters set of the NAR model.

Experimental Results
We only conducted knowledge distillation on NMT and TTS since there is no previous works on ASR yet. We train the NAR models in NMT and TTS with raw target token sequence instead of teacher outputs and compare the results with that in Table 2. The accuracy improvements of knowledge distillation are shown in Table 3. It can be seen that knowledge distillation can boost the accuracy of NAR in NMT and TTS, which is consistent with the previous works.

Why Knowledge Distillation Works
Recently,  find that knowledge distillation can reduce the complexity of data sets and help NAT to better model the variations in the output data. However, this explanation is reasonable on its own, but mainly from the perspective of data level and is not easy to understand. In this subsection, we analyze knowledge distillation from a more understandable and intuitive perspective, by observing the change of the token dependency based on our proposed CoMMA.
We measure the target token dependency by training CoMMA with the original training data and new data generated through knowledge distillation, respectively. The results are shown in Figure 3. It can be seen that knowledge distillation can decrease the attention density ratio R(p) on both tasks, indicating that knowledge distillation can reduce the dependency on the target-side context when predicting a target token, which can be helpful for NAT model training.

Study on Alignment Constraint
Without the help of target context, NAR models usually suffer from ambiguous attention to the source context, which affects the accuracy. Re- cently, many works have proposed a variety of approaches to help with the source-target alignment of NAR models, which can improve the estimation of the soft alignment in attention mechanism model. For example, Li et al. (2019b) constrain the KL-divergence of the encoder-to-decoder attention distributions between the AR and NAR models. Gu et al. (2017) predict the fertility of the source tokens to approximate the alignments between target sequence and source sequence. Guo et al. (2019a) convert the source token to target token with phrase table or embedding mapping for alignments.  predict the duration (the number of mel-spectrograms) of each phoneme.
In this section, we first study the effectiveness of alignment constraint for NAR models, and then analyze why alignment constraint can help the NAR models by observing the changes of token dependency based on our proposed CoMMA.

The Effectiveness of Alignment Constraint
Alignment Constraint for NAR Models We choose the attention constraint mechanism which is commonly used based on previous works for each task.
For NMT, we follow Li et al. (2019b) to minimize the KL-divergence between the attention distributions of AR and NAR model as follow: where A i and A i denote the source-target attention weights from the AR teacher model and NAR student model respectively. A , A ⊂ R N ×M where N and M are the number of tokens in the target and source sequence. For TTS, we follow  to extract the encoder-to-decoder attention alignments from the well-trained AR teacher model and convert them to phoneme duration sequence, and then train the duration predictor to expand the hidden of the source sequence to match the length of target sequence.
For ASR, since there is no previous work proposing alignment constraint for NAR, we design a new alignment constraint method and explore its effectiveness. We first calculate the expectation position of teacher's attention distributions for i-th target token: E i = M j=1 j * A i,j and cast it to the nearest integer. Then we constrain the attention weights of i-th target token for NAR model so that it can only attend to the source position between E i−1 and E i+1 . Specially, the first target token can only attend to the source position between 1 and E 2 while the last target token can only attend to the position between E N −1 and M . We apply this alignment constraint for ASR only in the training stage.  Experimental Results We follow the model configuration and datasets as described in Section 3.1, and explore the accuracy improvements when adding attention constraint to NAR models. The results are shown in Table 4. It can be seen that attention constraint can not only improve the performance of NMT and TTS as previous works (Li et al., 2019b; demonstrated, but also help the NAR-ASR model achieve better scores.

Why Alignment Constraint Works
We further analyze how alignment constraint could help on NAR models by measuring the changes of token dependency when adding alignment constraint on CoMMA.
For simplicity, we use the method described in Equation 6 to help the training of CoMMA, where the teacher model is the AR model and student model is CoMMA. We minimize KL-divergence between the per-head encoder-to-decoder attention distributions of the AR model and CoMMA. First, we normalize the encoder-to-decoder attention weights in each head of mix-attention to convert each row of the attention weights to a distribution:Â where A ⊂ R N ×(N +M ) is the weights of mixattention described in Section 2.2,Â ⊂ R N ×M is the normalized encoder-to-decoder attention weights, M and N is the length of source and target sequence. Then, we compute the KL-divergence loss for each head as follows: where A ⊂ R N ×M is the encoder-to-decoder attention of AR teacher model. We average L ac over all heads and layers and get the final attention constraint loss for CoMMA. We measure the token dependency by calculating the attention density ratio R(p) based on CoMMA, and show the results in Figure 4. It can be seen that alignment constraint can help reduce ratio R(p) on each task and thus reduce the dependency on target context when predicting target tokens. In the meanwhile, alignment constraint can help the model extract more information from the source context, which can help the learning of NAR models.
Another interesting finding is that NAR model in TTS benefits from attention constraint most as shown in Table 4, and in the meanwhile, TTS has the least attention density ratio as shown in Figure 4. These observations suggest that NAR models with small target token dependency could benefit largely from alignment constraint.

Related Works
Several works try to analyze and understand NAR models on different tasks. We discuss these analyses from the two aspects: knowledge distillation and source-target alignment constraint.
Knowledge Distillation Knowledge distillation has long been used to compress the model size (Hinton et al., 2015;Furlanello et al., 2018;Anil et al., 2018; or transfer the knowledge of teacher model to student model Liu et al., 2019a,b), and soon been applied to NAR models (Gu et al., 2017;Oord et al., 2017;Guo et al., 2019a;Li et al., 2019b;Guo et al., 2019b; to boost the accuracy. Some works focus on studying why knowledge distillation works: Phuong and Lampert (2019) provide some insights into the mechanisms of knowledge distillation by studying the special case of linear and deep linear classifiers and find that data geometry, optimization bias and strong monotonicity determine the success of distillation; Yuan et al. (2019) argue that the success of KD is also due to the regularization of soft targets, which might be as important as the similarity information between categories.
However, few works have studied the cause of why knowledge distillation benefits NAR training. Recently,  investigate why knowledge distillation is important for the training of NAR model in NMT task and find that knowledge distillation can reduce the complexity of data sets and help NAR model to learn the variations in the output data. Li et al. (2019b) explore the causes of the poor performance of the NAR model by observing the attention distributions and hidden states of NAR model.  presents some experiments and analysis to prove the necessity for multiple iterations generation for NAT. They also investigate the effectiveness of knowledge distillation in different task and make the assumption that teacher model can essentially clean the training data so that the distilled NAR model substantially outperforms NAR model trained with raw data.
Attention Alignment Constraint Previous work pointed out that adding additional alignment knowledge can improve the estimation of the soft alignment in attention mechanism model. For example, Chen et al. (2016) uses the Viterbi alignments of the IBM model 4 as an additional knowledge during NMT training by calculating the divergence between the attention weights and the statistical alignment information.
Compared with AR model, the attention distributions of NAR model are more ambiguous, which leads to the poor performance of the NAR model. Recent works employ attention alignment constraint between the well-trained AR and NAR model to train a better NAR model. Li et al. (2019b) leverages intermediate hidden information from a well-trained AR-NMT teacher model to improve the NAR-NMT model by minimizing KLdivergence between the per-head encoder-decoder attention of the teacher and the student.  choose the encoder-decoder attention head from the AR-TTS teacher as the attention alignments to improve the performance of the NAR model in TTS.

Conclusion
In this paper, we conducted a comprehensive study on NAR models in NMT, ASR and TTS tasks to analyze several research questions, including the difficulty of NAR generation and why knowledge distillation and alignment constraint can help NAR models. We design a novel CoMMA and a metric called attention density ratio to measure the dependency on target context when predicting a target token, which can analyze these questions in a unified method. Through a series of empirical studies, we demonstrate that the difficulty of NAR generation correlates on the target token dependency, and knowledge distillation as well as alignment constraint reduces the dependency of target tokens and encourages the model to rely more on source context for target token prediction, which improves the accuracy of NAR models. We believe our analyses can shed light on the understandings and further improvements on NAR models.