Semi-Autoregressive Neural Machine Translation

Existing approaches to neural machine translation are typically autoregressive models. While these models attain state-of-the-art translation quality, they are suffering from low parallelizability and thus slow at decoding long sequences. In this paper, we propose a novel model for fast sequence generation — the semi-autoregressive Transformer (SAT). The SAT keeps the autoregressive property in global but relieves in local and thus are able to produce multiple successive words in parallel at each time step. Experiments conducted on English-German and Chinese-English translation tasks show that the SAT achieves a good balance between translation quality and decoding speed. On WMT’14 English-German translation, the SAT achieves 5.58× speedup while maintaining 88% translation quality, significantly better than the previous non-autoregressive methods. When produces two words at each time step, the SAT is almost lossless (only 1% degeneration in BLEU score).


Introduction
Neural networks have been successfully applied to a variety of tasks, including machine translation. The encoder-decoder architecture is the central idea of neural machine translation (NMT). The encoder first encodes a source-side sentence x = x 1 . . . x m into hidden states and then the decoder generates the target-side sentence y = y 1 . . . y n from the hidden states according to an autoregressive model p(y t |y 1 . . . y t−1 , x) Recurrent neural networks (RNNs) are inherently good at processing sequential data. Sutskever * Part of this work was done when the author was at Institute of Automation, Chinese Academy of Sciences.   successfully applied RNNs to machine translation.  introduced attention mechanism into the encoder-decoder architecture and greatly improved NMT. GNMT (Wu et al., 2016) further improved NMT by a bunch of tricks including residual connection and reinforcement learning.
The sequential property of RNNs leads to its wide application in language processing. However, the property also hinders its parallelizability thus RNNs are slow to execute on modern hardware optimized for parallel execution. As a result, a number of more parallelizable sequence models were proposed such as ConvS2S (Gehring et al., 2017) and the Transformer (Vaswani et al., 2017). These models avoid the dependencies between dif-ferent positions in each layer thus can be trained much faster than RNN based models. When inference, however, these models are still slow because of the autoregressive property.
A recent work (Gu et al., 2017) proposed a non-autoregressive NMT model that generates all target-side words in parallel. While the parallelizability is greatly improved, the translation quality encounter much decrease. In this paper, we propose the semi-autoregressive Transformer (SAT) for faster sequence generation. Unlike Gu et al. (2017), the SAT is semi-autoregressive, which means it keeps the autoregressive property in global but relieves in local. As the result, the SAT can produce multiple successive words in parallel at each time step. Figure 1 gives an illustration of the different levels of autoregressive properties.
Experiments conducted on English-German and Chinese-English translation show that compared with non-autoregressive methods, the SAT achieves a better balance between translation quality and decoding speed. On WMT'14 English-German translation, the proposed SAT is 5.58× faster than the Transformer while maintaining 88% of translation quality. Besides, when producing two words at each time step, the SAT is almost lossless.
It is worth noting that although we apply the SAT to machine translation, it is not designed specifically for translation as Gu et al. (2017); Lee et al. (2018). The SAT can also be applied to any other sequence generation task, such as summary generation and image caption generation.

Related Work
Almost all state-of-the-art NMT models are autoregressive Wu et al., 2016;Gehring et al., 2017;Vaswani et al., 2017), meaning that the model generates words one by one and is not friendly to modern hardware optimized for parallel execution. A recent work (Gu et al., 2017) attempts to accelerate generation by introducing a non-autoregressive model. Based on the Transformer (Vaswani et al., 2017), they made lots of modifications. The most significant modification is that they avoid feeding the previously generated target words to the decoder, but instead feeding the source words, to predict the next target word. They also introduced a set of latent variables to model the fertilities of source words to tackle the multimodality problem in translation. Lee et al. (2018) proposed another non-autoregressive sequence model based on iterative refinement. The model can be viewed as both a latent variable model and a conditional denoising autoencoder. They also proposed a learning algorithm that is hybrid of lower-bound maximization and reconstruction error minimization.
The most relevant to our proposed semiautoregressive model is (Kaiser et al., 2018). They first autoencode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. What we have in common with their idea is that we have not entirely abandoned autoregressive, but rather shortened the autoregressive path.
A related study on realistic speech synthesis is the parallel WaveNet (Oord et al., 2017). The paper introduced probability density distillation, a new method for training a parallel feed-forward network from a trained WaveNet (Van Den Oord et al., 2016) with no significant difference in quality.
There are also some work share a somehow simillar idea with our work: character-level NMT (Chung et al., 2016;Lee et al., 2016) and chunkbased NMT Ishiwatari et al., 2017). Unlike the SAT, these models are not able to produce multiple tokens (characters or words) each time step. Oda et al. (2017) proposed a bitlevel decoder, where a word is represented by a binary code and each bit of the code can be predicted in parallel.

The Transformer
Since our proposed model is built upon the Transformer (Vaswani et al., 2017), we will briefly introduce the Transformer. The Transformer uses an encoder-decoder architecture. We describe the encoder and decoder below.

The Encoder
From the source tokens, learned embeddings of dimension d model are generated which are then modified by an additive positional encoding. The positional encoding is necessary since the network does not leverage the order of the sequence by recurrence or convolution. The authors use additive encoding which is defined as: P E(pos, 2i) = sin(pos/10000 2i/d model ) P E(pos, 2i + 1) = cos(pos/10000 2i/d model ) where pos is the position of a word in the sentence and i is the dimension. The authors chose this function because they hypothesized it would allow the model to learn to attend by relative positions easily. The encoded word embeddings are then used as input to the encoder which consists of N blocks each containing two layers: (1) a multihead attention layer, and (2) a position-wise feedforward layer.
Multi-head attention builds upon scaled dotproduct attention, which operates on a query Q, key K and value V: where d k is the dimension of the key. The authors scale the dot product by 1/ √ d k to avoid the inputs to softmax function growing too large in magnitude. Multi-head attention computes h different queries, keys and values with h linear projections, computes scaled dot-product attention for each query, key and value, concatenates the results, and projects the concatenation with another linear projection: The attention mechanism in the encoder performs attention over itself (Q = K = V ), so it is also called self-attention.
The second component in each encoder block is a position-wise feed-forward layer defined as: For more stable and faster convergence, residual connection (He et al., 2016) is applied to each layer, followed by layer normalization (Ba et al., 2016). For regularization, dropout (Srivastava et al., 2014) are applied before residual connections.

The Decoder
The decoder is similar with the encoder and is also composed by N blocks. In addition to the two layers in each encoder block, the decoder inserts a third layer, which performs multi-head attention over the output of the encoder. It is worth noting that, different from the encoder, the self-attention layer in the decoder must be masked with a causal mask, which is a lower triangular matrix, to ensure that the prediction for position i can depend only on the known outputs at positions less than i during training.

Group-Level Chain Rule
Standard NMT models usually factorize the joint probability of a word sequence y 1 . . . y n according to the word-level chain rule resulting in decoding each word depending on all previous decoding results, thus hindering the parallelizability. In the SAT, we extend the standard word-level chain rule to the group-level chain rule.
We first divide the word sequence y 1 . . . y n into consecutive groups denotes floor operation, K is the group size, and also the indicator of parallelizability. The larger the K, the higher the parallelizability. Except for the last group, all groups must contain K words. Then comes the group-level chain rule This group-level chain rule avoids the dependencies between consecutive words if they are in the same group. With group-level chain rule, the model no longer produce words one by one as the Transformer, but rather group by group. In next subsections, we will show how to implement the model in detail.

Long-Distance Prediction
In autoregressive models, to predict y t , the model should be fed with the previous word y t−1 . We refer it as short-distance prediction. In the SAT, however, we feed y t−K to predict y t , to which we refer as long-distance prediction. At the beginning of decoding, we feed the model with K special symbols <s> to predict y 1 . . . y K in parallel. Then y 1 . . . y K are fed to the model to predict y K+1 . . . y 2K in parallel. This process will continue until a terminator </s> is generated. Figure 3 gives illustrations for both short and longdistance prediction.

Relaxed Causal Mask
In the Transformer decoder, the causal mask is a lower triangular matrix, which strictly prevents earlier decoding steps from peeping information from later steps. We denote it as strict causal mask. However, in the SAT decoder, strict causal mask is not a good choice. As described in the previous subsection, in long-distance prediction, the model predicts y K+1 by feeding with y 1 . With strict causal mask, the model can only access to y 1 when predict y K+1 , which is not reasonable since y 1 . . . y K are already produced. It is better to allow the model to access to y 1 . . . y K rather than only y 1 when predict y K+1 . Therefore, we use a coarse-grained lower triangular matrix as the causal mask that allows peeping later information in the same group. We refer to it as relaxed causal mask. Given the target length n and the group size K, relaxed causal mask M ∈ R n×n and its elements are defined below: For a more intuitive understanding, Figure 4 gives a comparison between strict and relaxed causal mask.

The SAT
Using group-level chain rule instead of wordlevel chain rule, long-distance prediction instead of short-distance prediction, and relaxed causal  1 1 0 0  1 1 1 1 0 0  1 1 1 1 1 1  1 1 1 1 1 Figure 4: Strict causal mask (left) and relaxed causal mask (right) when the target length n = 6 and the group size K = 2. We mark their differences in bold.

Model
Complexity Acceleration Transformer mask instead of strict causal mask, we successfully extended the Transformer to the SAT. The Transformer can be viewed as a special case of the SAT, when the group size K = 1. The nonautoregressive Transformer (NAT) described in Gu et al. (2017) can also be viewed as a special case of the SAT, when the group size K is not less than maximum target length. Table 1 gives the theoretical complexity and acceleration of the model. We list two search strategies separately: beam search and greedy search. Beam search is the most prevailing search strategy. However, it requires the decoder states to be updated once every word is generated, thus hinders the decoding parallelizability. When decode with greedy search, there is no such concern, therefore the parallelizability of the SAT can be maximized.

Experiments
We evaluate the proposed SAT on English-German and Chinese-English translation tasks.

Experimental Settings
Datasets For English-German translation, we choose the corpora provided by WMT 2014 (Bojar et al., 2014). We use the newstest2013 dataset for development, and the newstest2014 dataset for test. For Chinese-English translation, the corpora  we use is extracted from LDC 1 . We chose the NIST02 dataset for development, and the NIST03, NIST04 and NIST05 datasets for test. For English and German, we tokenized and segmented them into subword symbols using byte-pair encoding (BPE) (Sennrich et al., 2015) to restrict the vocabulary size. As for Chinese, we segmented sentences into characters. For English-German translation, we use a shared source and target vocabulary. Table 2 summaries the two corpora.
Baseline We use the base Transformer model described in Vaswani et al. (2017) as the baseline, where d model = 512 and N = 6. In addition, for comparison, we also prepared a lighter Transformer model, in which two encoder/decoder blocks are used (N = 2), and other hyper-parameters remain the same.
Hyperparameters Unless otherwise specified, all hyperparameters are inherited from the base Transformer model. We try three different settings of the group size K: K = 2, K = 4, and K = 6. For English-German translation, we share the same weight matrix between the source and target embedding layers and the pre-softmax linear layer. For Chinese-English translation, we only share weights of the target embedding layer and the pre-softmax linear layer.

Search Strategies
We use two search strategies: beam search and greedy search. As mentioned in Section 4.4, these two strategies lead to different parallelizability. When beam size is set to 1, greedy search is used, otherwise, beam search is used.
Knowledge Distillation Knowledge distillation (Hinton et al., 2015;Kim and Rush, 2016) describes a class of methods for training a smaller student network to perform better by learning from a larger teacher network. For NMT, Kim and Rush (2016) proposed a sequence-level knowledge distillation method. In this work, we apply this method to train the SAT using a pre-trained

Initialization
Since the SAT and the Transformer have only slight differences in their architecture (see Figure 2), in order to accelerate convergence, we use a pre-trained Transformer model to initialize some parameters in the SAT. These parameters include all parameters in the encoder, source and target word embeddings, and pre-softmax weights. Other parameters are initialized randomly. In addition to accelerating convergence, we find this method also slightly improves the translation quality.
Training Same as Vaswani et al. (2017), we train the SAT by minimize cross-entropy with label smoothing. The optimizer we use is Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98 and ε = 10 −9 . We change the learning rate during training using the learning rate funtion described in Vaswani et al. (2017). All models are trained for 10K steps on 8 NVIDIA TITAN Xp with each minibatch consisting of about 30k tokens. For evaluation, we average last five checkpoints saved with an interval of 1000 training steps.

Evaluation Metrics
We evaluate the translation quality of the model using BLEU score (Papineni et al., 2002).
Implementation We implement the proposed SAT with TensorFlow (Abadi et al., 2016). The code and resources needed for reproducing the results are released at https://github.com/ chqiwang/sa-nmt. Table 3 summaries results of English-German translation. According to the results, the translation quality of the SAT gradually decreases as K increases, which is consistent with intuition. When K = 2, the SAT decodes 1.51× faster than the Transformer and is almost lossless in translation quality (only drops 0.21 BLEU score). With K = 6, the SAT can achieve 2.98× speedup while the performance degeneration is only 8%.

Results on English-German
When using greedy search, the acceleration becomes much more significant. When K = 6, the decoding speed of the SAT can reach about 5.58× of the Transformer while maintaining 88% Model b=1 b=16 b=32 b=64 Transformer 346ms 58ms 53ms 56ms SAT, K=2 229ms 38ms 32ms 32ms SAT, K=4 149ms 24ms 21ms 20ms SAT, K=6 116ms 20ms 17ms 16ms  of translation quality. Comparing with Gu et al. (2017); Kaiser et al. (2018); Lee et al. (2018), the SAT achieves a better balance between translation quality and decoding speed. Compared to the lighter Transformer (N = 2), with K = 4, the SAT achieves a higher speedup with significantly better translation quality.
In a real production environment, it is often not to decode sentences one by one, but batch by batch. To investigate whether the SAT can accelerate decoding when decoding in batches, we test the decoding latency under different batch size settings. As shown in Table 4, the SAT significantly accelerates decoding even with a large batch size.
It is also good to know if the SAT can still accelerate decoding on CPU device that does not support parallel execution as well as GPU. Results in Table 5 show that even on CPU device, the SAT can still accelerate decoding significantly. Table 6 summaries results on Chinese-English translation. With K = 2, the SAT decodes 1.69× while maintaining 97% of the translation quality. In an extreme setting where K = 6 and beam size = 1, the SAT can achieve 6.41× speedup while maintaining 83% of the translation quality.

Analysis
Effects of Knowledge Distillation As shown in Figure 5, sequence-level knowledge distillation is very effective for training the SAT. For larger K, the effect is more significant. This phenomenon is echoing with observations by Gu et al. (2017);Oord et al. (2017); Lee et al. (2018). In addition, we tried word-level knowledge distillation (Kim and Rush, 2016) but only a slight improvement was observed.
Position-Wise Cross-Entropy In Figure 6, we plot position-wise cross-entropy for various models. To compare with the baseline model, the results in the figure are from models trained on the original corpora, i.e., without knowledge distillation. As shown in the figure, positionwise cross-entropy has an apparent periodicity with a period of K. For positions in the same group, the position-wise cross-entropy increase monotonously, which indicates that the longdistance dependencies are always more difficult to model than short ones. It suggests the key to further improve the SAT is to improve the ability of modeling long-distance dependencies.
Case Study Table 7  Source Transformer the international football federation will severely punish the fraud on the football field SAT, k=2 fifa will severely punish the deception on the football field SAT, k=4 fifa a will severely punish the fraud on the football court SAT, k=6 fifa a will severely punish the fraud on the football football court Reference federation international football association to mete out severe punishment for fraud on the football field Source Transformer the largescale exhibition of campus culture will also be held during the meeting .
SAT, k=2 the largescale cultural cultural exhibition on campus will also be held during the meeting .
SAT, k=4 the campus campus exhibition will also be held during the meeting .
SAT, k=6 a largescale campus culture exhibition will also be held on the sidelines of the meeting .
Reference there will also be a large -scale campus culture show during the conference .

Source
Transformer this is the second time mr koizumi has visited the yasukuni shrine since he came to power .
SAT, k=2 this is the second time that mr koizumi has visited the yasukuni shrine since he took office .
SAT, k=4 this is the second time that koizumi has visited the yasukuni shrine since he came into power .
SAT, k=6 this is the second visit to the yasukuni shrine since mr koizumi came office power .
Reference this is the second time that junichiro koizumi has paid a visit to the yasukuni shrine since he became prime minister . Table 7: Three sample Chinese-English translations by the SAT and the Transformer. We mark repeated words or phrases by red font and underline. erate fluent sentences. As reported by Gu et al. (2017), instances of repeated words or phrases are most prevalent in their non-autoregressive model. In the SAT, this is also the case. This suggests that we may be able to improve the translation quality of the SAT by reducing the similarity of the output distribution of adjacent positions.

Conclusion
In this work, we have introduced a novel model for faster sequence generation based on the Transformer (Vaswani et al., 2017), which we refer to as the semi-autoregressive Transformer (SAT). Com-bining the original Transformer with group-level chain rule, long-distance prediction and relaxed causal mask, the SAT can produce multiple consecutive words at each time step, thus speedup decoding significantly. We conducted experiments on English-German and Chinese-English translation. Compared with previously proposed nonautoregressive models (Gu et al., 2017;Lee et al., 2018;Kaiser et al., 2018), the SAT achieves a better balance between translation quality and decoding speed. On WMT'14 English-German translation, the SAT achieves 5.58× speedup while maintaining 88% translation quality, significantly bet-ter than previous methods. When producing two words at each time step, the SAT is almost lossless (only 1% degeneration in BLEU score).
In the future, we plan to investigate better methods for training the SAT to further shrink the performance gap between the SAT and the Transformer. Specifically, we believe that the following two directions are worth study. First, use object function beyond maximum likelihood to improve the modeling of long-distance dependencies. Second, explore new method for knowledge distillation. We also plan to extend the SAT to allow the use of different group sizes K at different positions, instead of using a fixed value.