Towards Compact and Fast Neural Machine Translation Using a Combined Method

Neural Machine Translation (NMT) lays intensive burden on computation and memory cost. It is a challenge to deploy NMT models on the devices with limited computation and memory budgets. This paper presents a four stage pipeline to compress model and speed up the decoding for NMT. Our method first introduces a compact architecture based on convolutional encoder and weight shared embeddings. Then weight pruning is applied to obtain a sparse model. Next, we propose a fast sequence interpolation approach which enables the greedy decoding to achieve performance on par with the beam search. Hence, the time-consuming beam search can be replaced by simple greedy decoding. Finally, vocabulary selection is used to reduce the computation of softmax layer. Our final model achieves 10 times speedup, 17 times parameters reduction, less than 35MB storage size and comparable performance compared to the baseline model.


Introduction
Neural Machine Translation (NMT) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015) has recently gained popularity in solving the machine translation problem. Although NMT has achieved state-of-the-art performance for several language pairs Wu et al., 2016), like many other deep learning domains, it is both computationally intensive and memory intensive. This leads to a challenge of deploying NMT models on the devices with limited computation and memory budgets.
Weight pruning and knowledge distillation have been proved to be able to compress NMT models (See et al., 2016;Kim and Rush, 2016;. The above methods reduce the parameters from a global perspective. However, embeddings dominate the parameters in a relatively compact NMT model even if subword (Sennrich et al., 2016) (typical about 30K) is used. Character-aware methods (Ling et al., 2015;Lee et al., 2016) have fewer embeddings while suffer from slower decoding speed (Wu et al., 2016). Recent work by  has shown that weight sharing can be adopted to compress embeddings in language model. We are interested in applying embeddings weight sharing to NMT.
As for decoding speedup, Gehring et al. (2016);Kalchbrenner et al. (2016) tried to improve the parallelism in NMT by substituting CNNs for RNNs . Kim and Rush (2016) proposed sequencelevel knowledge distillation which allows us to replace beam search with greedy decoding. Gu et al. (2017) exploited trainable greedy decoding by the actor-critic algorithm (Konda and Tsitsiklis, 2002). Wu et al. (2016) evaluated the quantized inference of NMT. Vocabulary selection Mi et al., 2016;L'Hostis et al., 2016) was commonly used to speed up the softmax layer. Search pruning was also applied to speed up beam search (Hu et al., 2015;Wu et al., 2016;. Compared to search pruning, the speedup of greedy decoding is more attractive. Knowledge distillation improves the per- formance of greedy decoding while the method needs to run beam search over the training set, and therefore results in inefficiency for tens of millions of corpus. The trainable greedy decoding using a relatively sophisticated training procedure. We prefer a simple and fast approach that allows us to replace beam search with the greedy search. In this work, a novel approach is proposed to improve the performance of greedy decoding directly and the embeddings weight sharing is introduced into NMT. We investigate the model compression and decoding speedup for NMT from the views of network architecture, sparsification, computation and search strategy, and test the performance of their combination. Specifically, we present a four stage pipeline for model compression and decoding speedup. Firstly, we train a compact NMT model based on convolutional encoder and weight sharing. The convolutional encoder works well with smaller model size and is robust for pruning. Weight sharing further reduces the number of embeddings by several folds. Then weight pruning is applied to get a sparse model. Next, we propose fast sequence interpolation to improve the performance of greedy decoding directly. This approach uses batched greedy decoding to obtain samples and therefore is more efficient than Kim and Rush (2016). Finally, we use vocabulary selection to reduce the computation of the softmax layer. Our final model achieves 10× speedup, 17× parameters reduction, <35MB storage size and comparable performance compared to the baseline model.

Compact Network Architecture
Our network architecture is illustrated in Figure 1. This architecture works well with fewer parameters, which allows us to match the performance of the baseline model at lower capacity. The convolutional encoder is similar to Gehring et al. (2016), which consists of two convolutional neural networks: CNN-a for attention score computation and CNN-c for the conditional input to be fed to the decoder. The CNNs are constructed by blocks with residual connections (He et al., 2015). We use the relu6 1 non-linear activation function instead of tanh in Gehring et al. (2016) and achieve better training stability.
To compress the embeddings, the cluster based 2-component word representation is introduced: we cluster the words into C classes by word2vector 2 (Mikolov et al., 2013), and each class contains up to L words. Then the conventional embedding lookup table is replaced by C + L unique vectors. For each word, we first do a lookup from C class embeddings according to which cluster the word belongs, next we do another lookup from L location embeddings according to the location of the word. We concatenate the results of the two embedding lookup as the 2-component word representation. As a result, the number of embeddings is reduced from about C ×L to C +L. Referring to Gehring et al. (2016), position embeddings are concatenated to convey the absolute positional information of each source word within a sentence.

Reference (Y): Sample (S):
Interpolated Sample: replace the official said the grenade explosion did not cause any casualties or damage .
the official said the grenade blast did not cause any casualties or damage . the official said the grenade blast did not cause any death or injury nor any damage . Figure 2: Editing operation. We search for subsequences with the same boundary words between S and Y . The words within the boundary words can be different. Then we replace the subsequence in S by the subsequence in Y .

Weight Pruning
Then the iterative pruning (Han et al., 2015) is applied to obtain a sparse network, which allows us to use sparse matrix storage. In order to further reduce the storage size, most sparse matrix index of our pruned model is stored using uint8 and uint16 depend on the matrix dimension.

Fast Sequence Interpolation
Let (X, Y ) be a source-target sequence pair. Given X as input, S is the corresponding greedy decoding result using a well trained model. Then we make two assumptions: (1) LetS be a sequence close to S. If training with (X,S),S will replace S to become the result of greedy decoding with a probability P (S, S).
(2) The following relationship holds: where sim is a function measuring closeness such as edit-distance. IfS has higher evaluation metric 3 (we write as E) than S, according to (2) we have: We note that usingS as a label is more attractive than Y for improving the performance of greedy decoding. The reason is that S and Y are often quite different (Kim and Rush, 2016), resulting in a relatively low P (Y, S). We bridge the gap between S and Y by interpolating inner sequence between them. Specifically, we edit S toward Y , which can be seen as interpolation. Editing is a heuristic operation as illustrated in Figure 2. Concretely, let S s be a subsequence of S and let Y s be Algorithm 1 Editing algorithm of fast sequence interpolation.
Input: (X, Y, S, k): (X, Y ) is a sequence pair in training data. S is the result of the greedy decoding using source sequence X. k is the maximum number of tokens in replaced subsequence of S or Y . Output:S: the edited sample.
1: for s i in S, y j in Y do 2: if (s i == y j ) then 3: if (s i+p == y j+q ) then To obtain the target sequenceỸ for training, we substituteS for Y according to the following rule: where ε aims to ensure the quality ofS. We define substitution rate as the ratio ofS substituting Y over the training set. In summary, the following procedure is done iteratively: (1) get a new batch of (X, Y ), (2) run batched greedy decoding on X, (3) edit S to obtainS, (4) getỸ according to the substitution rule, (5) train on the batched (X,Ỹ ).

Vocabulary Selection
We use word alignment 4 (Dyer et al., 2013) to build a candidate dictionary. For each source word, we build a list of candidate target words. When decoding, top n candidates of each word are merged to form a short-list for softmax layer. We do not apply vocabulary selection in training. For the two translation task, top 50K and 30k most frequent words are kept respectively. The rest words are replaced with UNK. We only use sentences of length up to 50 symbols. We do not use any UNK handling methods for fair comparison. The evaluation metric is case-insensitive BLEU (Papineni et al., 2002) as calculated by the multi-bleu.perl script.
Hyper-parameters: For the baseline model, we use a 2-layer bidirectional GRU encoder (1 layer in each direction) and a 1-layer GRU decoder. In Baseline L , the embedding size is 512 and the hidden size is 1024. In Baseline S , the embedding size is 256 and the hidden size is 512. Our baseline models are similar to the architecture in DL4MT 6 . For the convolutional encoder model, 512 hidden units are used for the 6-layer CNN-a, and 256 hidden units are used for the 8-layer CNN-c. The embedding size is 256. The hidden size of the de- Substitution rate substitution rate(k=2) substitution rate(k=3) Figure 3: The performance and the substitution rate of FSI on English-German (newstest2013) development set with varying threshold ε and subsequence length limit k.
coder is 512. The kernel width in CNNs is 3. The number of clusters for both source and target vocabulary is 6. The editing rule for fast sequence interpolation is detailed in Algorithm 1. We use the top 50 candidates for each source word in vocabulary selection. The initial dropout rate is 0.3, and gradually decreases to 0 as the pruning rate increases. We use AdaDelta optimizer and clip the norm of the gradient to be no more than 1. Our methods are implemented with TensorFlow 7 (Abadi et al., 2015). We run one sentence decoding for all models under the same computing environment 8 .

Results and Discussions
Our experimental results are summarized in Table 1. The convolutional encoder model matches the performance of the GRU encoder model with about 2× fewer parameters. Combining with embeddings weight sharing results in a compact model that has about 3.5× fewer parameters than the baseline model. After pruning 80% of the weights, we reduce the parameters by about 17× with only a decrease of 0.2 BLEU. The storage size of the final models is about 30MB, which is easily fit into the memory of a mobile device. We find that the pruning rate of embeddings is highest even if weight sharing is used. Furthermore, the pruning rate of CNN layers is higher than GRU layers. This reveals that the CNNs are more robust for pruning than RNNs. The pruning rate of each 7 https://github.com/zxw866/CFNMT 8 We also test batched greedy decoding with a batch size 128. We find that batched greedy decoding is nearly ten times faster than one sentence greedy decoding. We conjecture that our current one sentence decoding implementation does not fully make use of available hardware optimized for parallel computations. We can obtain a higher speedup with well optimized one sentence decoding implementation.   layer and the performance with increasing pruning rate are detailed in Figure 4. The compact architecture reduces the decoding time by only 20%. The reason is that the decoding time is dominated by the softmax layer. After applying fast sequence interpolation, we replace beam search with greedy decoding, which results in a speedup of over 5× with little loss in performance. We find that the details of the editing rules have little effect on FSI.
Because we only acceptS that BLEU improved by more than the threshold ε, otherwise we choose the gold target sequence. Figure 3 shows that appropriate substitution rate is important for fast sequence interpolation. We conjecture that edited samples are still worse than gold target sequences, and therefore relatively high substitution rate may  lead to instability in training. The speedup of vocabulary selection is only about 30%. It shows that the softmax layer no longer dominates the decoding time when using greedy search.

Conclusion and Future Work
We investigate the model compression and decoding speedup for NMT from the views of network architecture, sparsification, computation and search strategy, and verify the performance on their combination. A novel approach is proposed to improve the performance of greedy decoding and the embeddings weight sharing is introduced into NMT. In the future, we plan to integrate weight quantization into our method.