Non-Autoregressive Translation by Learning Target Categorical Codes

Non-autoregressive Transformer is a promising text generation model. However, current non-autoregressive models still fall behind their autoregressive counterparts in translation quality. We attribute this accuracy gap to the lack of dependency modeling among decoder inputs. In this paper, we propose CNAT, which learns implicitly categorical codes as latent variables into the non-autoregressive decoding. The interaction among these categorical codes remedies the missing dependencies and improves the model capacity. Experiment results show that our model achieves comparable or better performance in machine translation tasks than several strong baselines.


Introduction
Non-autoregressive Transformer (NAT, Gu et al., 2018;Lee et al., 2018;Ghazvininejad et al., 2019) is a promising text generation model for machine translation. It introduces the conditional independent assumption among the target language outputs and simultaneously generates the whole sentence, bringing in a remarkable efficiency improvement (more than 10× speed-up) versus the autoregressive model. However, the NAT models still lay behind the autoregressive models in terms of BLEU (Papineni et al., 2002) for machine translation. We attribute the low-quality of NAT models to the lack of dependencies modeling for the target outputs, making it harder to model the generation of the target side translation.
A promising way is to model the dependencies of the target language by the latent variables. A line of research works (Kaiser et al., 2018;Shu et al., 2019;Ma et al., 2019) introduce latent variable modeling to the non-autoregressive Transformer and improves translation quality. The latent variables could be regarded as the springboard to bridge the modeling gap, introducing more informative decoder inputs than the previ-ously copied inputs. More specifically, the latentvariable based model first predicts a latent variable sequence conditioned on the source representation, where each variable represents a chunk of words. The model then simultaneously could generate all the target tokens conditioning on the latent sequence and the source representation since the target dependencies have been modeled into the latent sequence.
However, due to the modeling complexity of the chunks, the above approaches always rely on a large number (more than 2 15 , Kaiser et al., 2018; of latent codes for discrete latent spaces, which may hurt the translation efficiencythe essential goal of non-autoregressive decoding. Akoury et al. (2019) introduce syntactic labels as a proxy to the learned discrete latent space and improve the NATs' performance. The syntactic label greatly reduces the search space of latent codes, leading to a better performance in both quality and speed. However, it needs an external syntactic parser to produce the reference syntactic tree, which may only be effective in limited scenarios. Thus, it is still challenging to model the dependency between latent variables for non-autoregressive decoding efficiently.
In this paper, we propose to learn a set of latent codes that can act like the syntactic label, which is learned without using the explicit syntactic trees. To learn these codes in an unsupervised way, we use each latent code to represent a fuzzy target category instead of a chunk as the previous research (Akoury et al., 2019). More specifically, we first employ vector quantization  to discretize the target language to the latent space with a smaller number (less than 128) of latent variables, which can serve as the fuzzy word-class information each target language word. We then model the latent variables with conditional random fields (CRF, Lafferty et al., 2001;Sun et al., 2019). To avoid the mismatch of the training and inference for latent variable modeling, we propose using a gated neural network to form the decoder inputs. Equipping it with scheduled sampling (Bengio et al., 2015), the model works more robustly.
Experiment results on WMT14 and IWSLT14 show that CNAT achieves the new state-of-theart performance without knowledge distillation. With the sequence-level knowledge distillation and reranking techniques, the CNAT is comparable to the current state-of-the-art iterative-based model while keeping a competitive decoding speedup.

Background
Neural machine translation (NMT) is formulated as a conditional probability model p(y|x), which models a sentence y = {y 1 , y 2 , · · · , y m } in the target language given the input x = {x 1 , x 2 , · · · , x n } from the source language.

Non-Autoregressive Neural Machine
Translation Gu et al. (2018) proposes Non-Autoregressive Transformer (NAT) for machine translation, breaking the dependency among target tokens, thus achieving simultaneous decoding for all tokens. For a source sentence, a non-autoregressive decoder factorizes the probability of its target sentence as: where θ is the set of model parameters.
NAT has a similar architecture to the autoregressive Transformer (AT, Vaswani et al., 2017), which consists of a multi-head attention based encoder and decoder. The model first encodes the source sentence x 1:n as the contextual representation e 1:n , then employs an extra module to predict the target length and form the decoder inputs.
• Length Prediction: Specifically, the length predictor in the bridge module predicts the target sequence length m by: where ∆ L is the length difference between the target and source sentence, φ is the parameter of length predictor.
where τ is a hyper-parameter to control the sharpness of the softmax function.
With the computed decoder inputs h, NAT generates target sequences simultaneously by arg max yt p(y t |x; θ) for each timestep t, effectively reduce computational overhead in decoding (see Figure 1b). Though NAT achieves around 10× speedup in machine translation than autoregressive models, it still suffers from potential performance degradation (Gu et al., 2018). The results degrade since the removal of target dependencies prevents the decoder from leveraging the inherent sentence structure in prediction. Moreover, taking the copied source representation as decoder inputs implicitly assume that the source and target language share a similar order, which may not always be the case (Bao et al., 2019).

Latent Transformer
To bridge the gap between non-autoregressive and autoregressive decoding, Kaiser et al. (2018) introduce the Latent Transformer (LT). It incorporates non-autoregressive decoding with conditional dependency as the latent variable to alleviate the degradation resulted from the absence of dependency: where z = {z 1 , · · · , z L } is the latent variable sequence and the L is the length of the latent sequence, φ and θ are the parameter of latent predictor and translation model, respectively.
The LT architecture stays unchanged from the origin NAT models, except for the latent predictor and decoder inputs. During inference, the Latent Transformer first autoregressively predicts the latent variables z, then non-autoregressively produces the entire target sentence y conditioned on the latent sequence z (see Figure 1c). Ma et al. (2019); Shu et al. (2019) extend this idea and model z as the continuous latent variables, achieving a promising result, which replaces the autoregressive predictor with the iterative transformation layer.

Approach
In this section, we present our proposed CNAT, an extension to the Transformer incorporated with non-autoregressive decoding for target tokens and autoregressive decoding for latent sequences.
In brief, CNAT follows the architecture of Latent Transformer (Kaiser et al., 2018), except for the latent variable modeling (in § 3.1 and § 3.2) and inputs initialization (in § 3.3).

Modeling Target Categorical Information by Vector Quantization
Categorical information has achieved great success in neural machine translation, such as partof-speech (POS) tag in autoregressive translation (Yang et al., 2019) and syntactic label in nonautoregressive translation (Akoury et al., 2019). Inspired by the broad application of categorical information, we propose to model the implicit categorical information of target words in a nonautoregressive Transformer. Each target sequence y = y 1:m will be assigned to a discrete latent variable sequence z = z 1:m . We assume that each z i will capture the fuzzy category of its token y i . Then, the conditional probability p(y|x) is factorized with respect to the categorical latent variable: However, it is computationally intractable to sum all configurations of latent variables. Following the spirit of the latent based model (Kaiser et al., 2018;, we employ a vector quantized technique to maintain differentiability through the categorical modeling and learn the latent variables straightforward. Vector Quantization. The vector quantization based methods have a long history of being successfully in machine learning models. In vector quantization, each target representation repr(y i ) ∈ R d model is passed through a discretization bottleneck using a nearest-neighbor lookup on embedding matrix Q ∈ R K×d model , where K is the number of categorical codes. For each y i in the target sequence, we define its categorical variable z i and latent code q i as: where || · || 2 is the l 2 distance, [K] denote the set {1, 2, · · · , K}. Intuitively, we adopt the embedding of y as the target representation: where the embedding matrix of the target language is shared with the softmax layer of the decoder.
Exponential Moving Average. Following the common practice of vector quantization, we also employ the exponential moving average (EMA) technique to regularize the categorical codes.
Put simply, the EMA technique could be understood as basically the k-means clustering of the hidden states with a sort of momentum. We maintain an EMA over the following two quantities for each j ∈ [K]: 1) the count c j measuring the number of target representations that have Q j as its nearest neighbor, and 2) Q j . The counts are updated over a mini-batch of targets {y 1 , y 2 , · · · , y m×B } with: then, the latent code Q j being updated with: where 1[·] is the indicator function and λ is a decay parameter, B is the size of the batch.

Modeling Categorical Sequence with Conditional Random Fields
Our next insight is transferring the dependencies among the target outputs into the latent spaces.
Since the categorical variable captures the fuzzy target class information, it can be a proxy of the target outputs. We further employ a structural prediction module instead of the standard autoregressive Transformer to model the latent sequence. The former can explicitly model the dependencies among the latent variables and performs exact decoding during inference.
Conditional Random Fields. We employ a linear-chain conditional random fields (CRF, Lafferty et al., 2001) to model the categorical latent variables, which is the most common structural prediction model. Given the source input x = (x 1 , · · · , x n ) and its corresponding latent variable sequence z = (z 1 , · · · , z m ), the CRF model defines the probability of z as: where Z(x) is the normalize factor, s(z i , x, i) is the emit score of z i at the position i, and the Before computing the emit score and transition score in Eq. 9, we first take h = h 1:m as the inputs and compute the representation f = Transfer(h), where Transfer(·) denotes a twolayer vanilla Transformer decoding function including a self-attention block, an encoder-decoder block followed by a feed-forward neural network block (Vaswani et al., 2017).
We then compute the emit score and the transition score. For each position i, we compute the emit score with a linear transformation: are the parameters. We incorporate the positional context and compute its transition score with: where Biaffine(·) : R 2d model → R dt×dt is a biaffine neural network (Dozat and Manning, 2017), E 1 and E 2 ∈ R dt×K are the transition matrix.

Fusing Source Inputs and Latent Codes via Gated Function
One potential issue is that the mismatch of the training and inference stage for the used categorical variables. Suppose we train the decoder with the quantized categorical variables z, which is inferred from the target reference. In that case, we may fail to achieve satisfactory performance with the predicted categorical variables during inference. We intuitively apply the gated neural network (denote as GateNet) to form the decoder inputs by fusing the copied decoder inputs h = h 1:m and the latent codes q = q 1:m , since the copied decoder inputs h is still informative to nonautoregressive decoding: where the FFN(·) : R 2d model → R d model is a twolayer feed-forward neural networks and σ(.) is the sigmoid function.

Training
While training, we first compute the reference z ref by the vector quantization and employ the EMA to update the quantized codes. The loss of the CRFbased predictor is computed with: To equip with the GateNet, we randomly mix the z ref and the predicted z pred as: where p ∼ U[0, 1] and τ is the threshold we set 0.5 in our experiments. Grounding on the z mix , the non-autoregressive translation loss is computed with: With the hyper-parameter α, the overall training loss is:

Inference
CNAT selects the best sequence by choosing the highest-probability latent sequence z with Viterbi decoding (Viterbi, 1967), then generate the tokens with: and y * = arg max where identifying y * only requires independently maximizing the local probability for each output position.

Experiments
Datasets. We conduct the experiments on the most widely used machine translation benchmarks: WMT14 English-German (WMT14 EN-DE, 4.5M pairs) 1 and IWSLT14 German-English (IWSLT14, 160K pairs) 2 . The datasets are processed with the Moses script (Koehn et al., 2007), and the words are segmented into subword units using byte-pair encoding (Sennrich et al., 2016, BPE). We use the shared subword embeddings between the source language and target language for the WMT datasets and the separated subword embeddings for the IWSLT14 dataset.
Model Setting. In the case of IWSLT14 task, we use a small setting ( Optimization. We optimize the parameter with the Adam (Kingma and Ba, 2015) with β = (0.9, 0.98). We use inverse square root learning rate scheduling (Vaswani et al., 2017) for the WMT tasks and linear annealing schedule (Lee et al., 2018) from 3 × 10 −4 to 1 × 10 −5 for the IWSLT14 task. Each mini-batch consists of 2048 tokens for IWSLT14 and 32K tokens for WMT tasks.
Distillation. Sequence-level knowledge distillation (Hinton et al., 2015) is applied to alleviate the multi-modality problem (Gu et al., 2018) while training. We follow previous studies on NAT (Gu et al., 2018;Lee et al., 2018;Wei et al., 2019) and use translations produced by a pre-trained autoregressive Transformer (Vaswani et al., 2017) as the training data.
Reranking. We also include the results that come at reranked parallel decoding (Gu et al., 2018;Guo et al., 2019;Wei et al., 2019), which generates several decoding candidates in parallel and selects the best via re-scoring using a pre-trained autoregressive model. Specifically, we first predict the target lengthm and generate output sequence with arg max decoding for each length candidate m ∈ [m−∆m,m+∆m] (∆m = 4 in our experiments, means there are N = 9 candidates), which was called length parallel decoding (LPD). Then, we use the pre-trained teacher to rank these sequences and identify the best overall output as the final output.
Baselines. We compare the CNAT with several strong NAT baselines, including: • We compare the proposed CNAT against baselines both in terms of generating quality and inference speedup. For all our tasks, we obtain the performance of baselines by either directly using the performance figures reported in the previous works if they are available or producing them by using the open-source implementation of baseline algorithms on our datasets.
Metrics. We evaluate using the tokenized and cased BLEU scores (Papineni et al., 2002). We highlight the best NAT result with bold text.

Results
Translation Quality. First, we compare CNAT with the NAT models without using advanced techniques, such as knowledge distillation, reranking,   or iterative refinements. The results are listed in Table 1. The CNAT achieves significant improvements (around 11.5 BLEU in EN-DE, more than 14.5 BLEU in DE-EN) over the vanilla NAT, which indicates that modeling categorical information could improve the modeling capability of the NAT model. Also, the CNAT achieves better results than Flowseq and SynST, which demonstrates the effectiveness of CNAT in modeling dependencies between the target outputs. The performance of the NAT models with advance techniques (sequence-level knowledge distillation or reranking) is listed in Table 2 and Table 3. Coupling with the knowledge distillation techniques, all NAT models achieve remarkable improvements.
Our best results are obtained with length parallel decoding, which employs a pretrained Transformer to rerank the multiple parallels generated candidates of different target lengths. Specifically, on a large scale WMT14 dataset, CNAT surpasses the NAT-DCRF by 0.71 BLEU score in DE-EN but  slightly under the NAT-DCRF around 0.20 BLEU in EN-DE, which shows that the CNAT is comparable to the state-of-the-art NAT model. Also, we can see that a larger "N" leads to better results (N = 100 vs. N = 10 of NAT-FT, N = 19 vs. N = 9 of NAT-DCRF, etc.); however, it always comes at the degradation of decoding efficiency. We also compare our CNAT with the NAT models that employ an iterative decoding technique and list the results in Table 4. The iterative-based non-autoregressive Transformer captures the target language's dependencies by iterative generating based on the previous iteration output, which is an important exploration for a non-autoregressive generation. With the iteration number increasing, the performance improving, the decoding speed-up dropping, whatever the IR-NAT or CMLM. We can see that the CNAT achieves a better result than the CMLM with four iterations and IR-NAT with ten iterations, even close to the CMLM with ten iterations while keeping the benefits of a one-shot generation.
Translation Efficiency. As depicted in Figure 2, we validate the efficiency of CNAT. Put simply, the decoding speed is measured sentence-by-sentence, and the speed-up is computed by comparing it with the Transformer. Figure 2a and Figure 2b show the BLEU scores and decoding speed-up of NAT models. The former compares the pure NAT models. The latter compares NAT model inference with advanced decoding techniques (parallel reranking or iterative-based decoding) 3 .
We can see from Figure    CNAT is located on the top-right of the baselines. The CNAT outperforms our baselines in BLEU if speed-up is held, and in speed-up if BLEU is held, indicating CNAT outperforms previous state-ofthe-art NAT methods. Although iterative models like CMLM achieves competitive BLEU scores, they only maintain minor speed advantages over Transformer. In contrast, CNAT remarkably improves the inference speed while keeping a competitive performance.
Effectiveness of Categorical Modeling. We further conduct the experiments on the test set of IWSLT14 to analyze the effectiveness of our categorical modeling and its influence on translation quality. We regard the categorical predictor as a sequence-level generation task and list its BLEU score in Table 5. As see, a better latent prediction can yield a better translation. With the z ref as the latent sequence, the model achieves surprisingly good performance on this task, showing the usefulness of the learned categorical codes. We also can see that the CNAT decoding with reference length only slightly (0.44 BLEU) better than it with predicted length, indicat-  ing that the model is robust.

Ablation Study
We further conduct the ablation study with different CNAT variant on dev set of IWSLT14.
Influence of K. We can see the CRF with the categorical number K = 64 achieves the highest score (line 2). A smaller or larger K neither has a better result. The AR predictor may have a different tendency: with a larger K = 128, it achieves a better performance. However, a larger K may lead to a higher latency while inference, which is not the best for non-autoregressive decoding. In our experiments, the K = 64 can achieve the highperformance and be smaller enough to keep the low-latency during inference.
CRF versus AR. Experiment results show that the CRF-based predictor is better than the AR predictor. We can see that the CRF-based predictor surpasses the Transformer predictor 3.5 BLEU (line 2 vs. line 5) with the GateNet; without the  GateNet, the gap enlarges to 5.3 BLEU (line 4 vs. line 6). It is consistent with our intuition that CRF is better than Transformer to model the dependencies among latent variables on machine translation when the number of categories is small.
GateNet. Without the GateNet, the CNAT with AR predictor degenerates a standard LT model with a smaller latent space. We can see its performance is even lower than the NAT-baselines (line 6 vs. line 8). Equipping with the GateNet and the schedule sampling, it outperforms the NAT baseline with a large margin (around 4.0 BLEU), showing that the GateNet mechanism plays an essential role in our proposed model.

Code Study
To analyze the learned category, we further compute its relation to two off-the-shelf categorical information: the part-of-speech (POS) tags and the frequency-based clustered classes. For the former, we intuitively assign the POS tag of a word to its sub-words and compute the POS tag frequency for the latent codes. For the latter, we roughly assign the category of a subword according to its frequency. It needs to mention that the number of frequency-based classes is the same as that of the POS tags.
Quantitative Results. We first compute the V-Measure (Rosenberg and Hirschberg, 2007) score between the latent categories to POS tags and subwords frequencies. The results are listed in Table 7. Overall, the "w/ POS tags" achieves a higher V-Measure score, indicating that the latent codes are more related to the POS tags than sub-words frequencies. The homogeneity score (H-score) evaluates the purity of the category. We also can see that the former has a relatively higher H-score than the latter (0.70 vs. 0.62), which is consistent with our intuition.
Case Analysis. As shown in Figure 3, we also depict the POS tags distribution for the top 10 frequent latent variables on the test set of IWSLT14 4 . 4 More details can be found in Appendix B. We can see a sharp distribution for each latent variable, showing that our learned fuzzy classes are meaningful.

Related Work
Non-autoregressive Machine Translation. Gu et al. (2018) first develop a non-autoregressive Transformer (NAT) for machine translation, which produces the outputs in parallel, and the inference speed is thus significantly boosted. Due to the missing of dependencies among the target outputs, the translation quality is largely sacrificed. A line of work proposes to mitigate such performance degradation by enhancing the decoder inputs. Lee et al. (2018) propose a method of iterative refinement based on the previous outputs. Guo et al. (2019) enhance decoder input by introducing the phrase table in statistical machine translation and embedding transformation. There are also some work focuses on improving the decoder inputs' supervision, including imitation learning from autoregressive models (Wei et al., 2019) or regularizing the hidden state with backward reconstruction error .
Another work proposes modeling the dependencies among target outputs, which is explicitly missed in the vanilla NAT models. Qian et al. (2020); Ghazvininejad et al. (2019) propose to model the target-side dependencies with a masked language model, modeling the directed dependencies between the observed target and the unobserved words. Different from their work, we model the target-side dependencies in the latent space, which follows the latent variable Transformer fashion.
Latent Variable Transformer. More close to our work is the latent variable Transformer, which takes the latent variable as inputs to modeling the target-side information. Shu et al. (2019) combine continuous latent variables and deterministic inference procedure to find the target sequence that maximizes the lower bound to the log-probability. Ma et al. (2019) propose to use generative flows to the model complex prior distribution. Kaiser et al. (2018) propose to autoregressively decode a shorter latent sequence encoded from the target sentence, then simultaneously generate the sentence from the latent sequence. Bao et al. (2019) model the target position of decode input as a latent variable and introduce a heuristic search algorithm to guide the position learning. Akoury et al. (2019) first autoregressively predict a chunked parse tree and then simultaneously generate the target tokens from the predicted syntax.

Conclusion
We propose CNAT, which implicitly models the categorical codes of the target language, narrowing the performance gap between the nonautoregressive decoding and autoregressive decoding. Specifically, CNAT builds upon the latent Transformer and models the target-side categorical information with vector quantization and conditional random fields (CRF) model. We further employ a gated neural network to form the decoder inputs. Equipped with the scheduled sampling, CNAT works more robust. As a result, the CNAT achieves a significant improvement and moves closer to the performance of the Transformer on machine translation.  Results. We can see than in Table 8