Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture

Sparse models require less memory for storage and enable a faster inference by reducing the necessary number of FLOPs. This is relevant both for time-critical and on-device computations using neural networks. The stabilized lottery ticket hypothesis states that networks can be pruned after none or few training iterations, using a mask computed based on the unpruned converged model. On the transformer architecture and the WMT 2014 English-to-German and English-to-French tasks, we show that stabilized lottery ticket pruning performs similar to magnitude pruning for sparsity levels of up to 85%, and propose a new combination of pruning techniques that outperforms all other techniques for even higher levels of sparsity. Furthermore, we confirm that the parameter’s initial sign and not its specific value is the primary factor for successful training, and show that magnitude pruning cannot be used to find winning lottery tickets.


Introduction
Current neural networks are heavily growing in depth, with many fully connected layers. As every fully connected layer includes large matrices, models often contain millions of parameters. This is commonly seen as an over-parameterization (Dauphin and Bengio, 2013;Denil et al., 2013). Different techniques have been proposed to decide which weights can be pruned. In structured pruning techniques (Voita et al., 2019), whole neurons or even complete layers are removed from the network. Unstructured pruning only removes individual connections between neurons of succeeding layers, keeping the global network architecture intact. The first technique directly results in smaller model sizes and faster inference, while the second offers more flexibility in the selection of which parameters to prune. Although the reduction in necessary storage space can be realized using sparse matrix representations (Stanimirovi and Tasic, 2009), most popular frameworks currently do not have sufficient support for sparse operations. However, there is active development for possible solutions (Liu et al., 2015;Han et al., 2016;. This paper compares and improves several unstructured pruning techniques. The main contributions of this paper are to: • verify that the stabilized lottery ticket hypothesis (Frankle et al., 2019) performs similar to magnitude pruning (Narang et al., 2017) on the transformer architecture (Vaswani et al., 2017) with 60M parameters up to a sparsity of 85%, while magnitude pruning is superior for higher sparsity levels.
• demonstrate significant improvements for high sparsity levels over magnitude pruning by using it in combination with the lottery ticket hypothesis.
• confirm that the signs of the initial parameters are more important than the specific values to which they are reset, even for large networks like the transformer.
• show that magnitude pruning cannot be used to find winning lottery tickets, i.e., the final mask reached using magnitude pruning is no indicator for which initial weights are most important.
2 Related Work Han et al. (2015) propose the idea of pruning weights with a low magnitude to remove connections that have little impact on the trained model. Narang et al. (2017) incorporate the pruning into the main training phase by slowly pruning parameters during the training, instead of performing one big pruning step at the end. Zhu and Gupta (2018) provide an implementation for magnitude pruning in networks designed using the tensor2tensor software (Vaswani et al., 2018). Frankle and Carbin (2018) propose the lottery ticket hypothesis, which states that dense networks contain sparse sub-networks that can be trained to perform as good as the original dense model. They find such sparse sub-networks in small architectures and simple image recognition tasks and show that these sub-networks might train faster and even outperform the original network. For larger models, Frankle et al. (2019) propose to search for the sparse sub-network not directly after the initialization phase, but after only a few training iterations. Using this adapted setup, they are able to successfully prune networks having up to 20M parameters. They also relax the requirement for lottery tickets so that they only have to beat randomly initialized models with the same sparsity level. Zhou et al. (2019) show that the signs of the weights in the initial model are more important than their specific values. Once the least important weights are pruned, they set all remaining parameters to fixed values, while keeping their original sign intact. They show that as long as the original sign remains the same, the sparse model can still train more successfully than one with a random sign assignment. Frankle et al. (2020) reach contradicting results for larger architectures, showing that random initialization with original signs hurts the performance.  compare different pruning techniques on challenging image recognition and machine translation tasks and show that magnitude pruning achieves the best sparsity-accuracy tradeoff while being easy to implement.
In concurrent work, Yu et al. (2020) test the stabilized lottery ticket on the transformer architecture and the WMT 2014 English→German task, as well as other architectures and fields. This paper extends the related works by demonstrating and comparing the applicability of different pruning techniques on a deep architecture for two translation tasks, as well as proposing a new combination of pruning techniques for improved performance.

Pruning Techniques
In this section, we give a brief formal definition of each pruning technique. For a more detailed description, refer to the respective original papers.
In the given formulas, a network is assumed to be specified by its parameters θ. When training the network for T iterations, θ t for t ∈ [0, T ] represents the parameters at timestep t.
Magnitude Pruning (MP) relies on the magnitude of parameters to decide which weights can be pruned from the network. Different techniques to select which parameters are selected for pruning have been proposed (Collins and Kohli, 2014;Han et al., 2015;Guo et al., 2016;Zhu and Gupta, 2018). In this work, we rely on the implementation from Zhu and Gupta (2018) where the parameters of each layer are sorted by magnitude, and during training, an increasing percentage of the weights are pruned. It is important to highlight that MP is the only pruning technique not requiring multiple training runs.
Lottery Ticket (LT) pruning assumes that for a given mask m, the initial network θ 0 already contains a sparse sub-network θ 0 m that can be trained to the same accuracy as θ 0 . To determine m, the parameters of each layer in the converged model θ T are sorted by magnitude, and m is chosen to mask the smallest ones such that the target sparsity s T is reached. We highlight that even though m is determined using θ T , it is then applied to θ 0 before the sparse network is trained. To reach high sparsity without a big loss on accuracy, Frankle and Carbin (2018) recommend to prune iteratively, by training and resetting multiple times.
Stabilized Lottery Ticket (SLT) pruning is an adaptation of LT pruning for larger models. Frankle et al. (2019) propose to apply the computed mask m not to the initial model θ 0 , but to an intermediate checkpoint θ t where 0 < t T is chosen to be early during the training. They recommend to use 0.001T ≤ t ≤ 0.07T and refer to it as iterative magnitude pruning with rewinding. We highlight that Frankle et al. (2019) always choose θ t from the first, dense model, while this work choses θ t from the last pruning iteration.
Constant Lottery Ticket (CLT) pruning assumes that the specific random initialization is not important. Instead, only the corresponding choice of signs affects successful training. To show this, Zhou et al. (2019) propose to compute θ t m as in SLT pruning, but then to train f (θ t m) as the sparse model. Here, f sets all remaining parameters p in each layer l to sign(p) · α l , i.e., all param-eters in each layer have the same absolute value, but their original sign. In all of our experiments, α l is chosen to be α l = 6 n l in +n l out where n l in and n lout are the respective incoming and outgoing connections to other layers.
SLT-MP is a new pruning technique, proposed in this work. It combines both SLT pruning and MP in the following way: First, SLT pruning is used to find a mask m with intermediate sparsity s i . This might be done iteratively. θ t m with sparsity s i is then used as the initial model for MP (i.e., θ 0 = θ t m). Here, in the formula for MP, s 0 = s i . We argue that this combination is beneficial, because in the first phase, SLT pruning removes the most unneeded parameters, and in the second phase, MP can then slowly adapt the model to a higher sparsity.
MP-SLT is analogue to SLT-MP: First, MP is applied to compute a trained sparse network θ T with sparsity s i . This trained network directly provides the corresponding mask m. θ t m is then used for SLT pruning until the target sparsity is reached. This pruning technique tests whether MP can be used to find winning lottery tickets.

Experiments
We train the models on the WMT 2014 English→German and English→French datasets, consisting of about 4.5M and 36M sentence pairs, respectively. newstest2013 and 2014 are chosen to be the development and test sets.
All experiments have been performed using the base transformer architecture as described in (Vaswani et al., 2017 Zhu and Gupta, 2018). For efficiency reasons, weights are only pruned every 10k iterations. Unless stated otherwise, we start with initial sparsity s 0 = 0. The final sparsity s T is individually given for each experiment.
We prune only the matrices, not biases. We report the approximate memory consumption of all trained models using the Compressed Sparse Column (CSC) format (Stanimirovi and Tasic, 2009), which is the default for sparse data storage in the SciPy toolkit (Virtanen et al., 2020).
Our initial experiments have shown that Adafactor leads to an improvement of 0.5 BLEU compared to Adam. Hence, we select it as our optimizer with a learning rate of lr(t) = 1 max(t,w) for w = 10k warmup steps. We note that this differs from the implementation by , in which Adam has been used. We highlight that for all experiments that require a reset of parameter values (i.e., LT, SLT, CLT, SLT-MP, and MP-SLT), we reset t to 0, to include the warmup phase in every training run.
A shared vocabulary of 33k tokens based on word-pieces (Wu et al., 2016) is used. The reported case-sensitive, tokenized BLEU scores are computed using SacreBLEU (Post, 2018), TER scores are computed using MultEval (Clark et al., 2011). All results are averaged over two separate training runs. For all experiments that require models to be reset to an early point during training, we select a checkpoint after 25k iterations.
All iterative pruning techniques except SLT-MP are pruned in increments of 10 percentage points up to 80%, then switching to 5 points increments, and finally pruning to 98% sparsity. SLT-MP is directly trained using SLT pruning to 50% and further reduced by SLT to 60%, before switching to MP.

Experimental Results
In this section, we evaluate the experimental results for English→German and English→French translation given in Tables 1 and 2 to provide a comparison between the different pruning techniques described in Section 3.
MP Tables 1 and 2 clearly show a trade-off between accuracy and network performance. For every increase in sparsity, the performance degrades accordingly. We especially note that even for a sparsity of 50%, the baseline performance cannot be achieved. In contrast to all other techniques in this paper, MP does not require any reset of parameter values. Therefore, the training duration is not increased.
LT Frankle and Carbin (2018) test the LT hypothesis on the small ResNet-50 architecture (He et al., 2016) which is applied to ImageNet (Russakovsky   , 2015).  apply LT pruning to the larger transformer architecture and the translation task WMT 2014 English→German, noting that it has been outperformed by MP. As seen in Table 1, simple LT pruning is outperformed by MP at all sparsity levels. Because LT pruning is an iterative process, training a network with sparsity 98% requires to train and reset the model 13 times, causing a big training overhead without any gain in performance. Therefore, simple LT pruning cannot be recommended for complex architectures.
SLT The authors of the SLT hypothesis (Frankle et al., 2019) state that after 0.1-7% of the training, the intermediate model can be pruned to a sparsity of 50-99% without serious impact on the accuracy. As listed in Tables 1 and 2, this allows the network to be pruned up to 60% sparsity without a significant drop in BLEU, and is on par with MP up to 85% sparsity. As described in Section 4, for resetting the models, a checkpoint after t = 25k iterations is used. For a total training duration of 500k iterations, this amounts to 5% of the training and is therefore within the 0.1-7% bracket given by Frankle et al. (2019). For individual experiments, we have also tried t ∈ {12.5k, 37.5k, 500k} and have gotten similar results to those listed in this paper. It should be noted that for the case t = 500k, SLT pruning becomes a form of MP, as no reset happens anymore. We propose a more thorough hyperparameter search for the optimal t value as future work.
Importantly, we note that the magnitude of the parameters in both the initial and the final models increases with every pruning step. This causes the model with 98% sparsity to have weights greater than 100, making it unsuitable for checkpoint averaging, as the weights become too sensitive to minor changes. Yu et al. (2020) report that they do successfully apply checkpoint averaging. This might be because they choose θ t from the dense training run for resetting, while we choose θ t from the most recent sparse training.
CLT The underlying idea of the LT hypothesis is, that the untrained network already contains a sparse sub-network which can be trained individually. Zhou et al. (2019) show that only the signs of the remaining parameters are important, not their specific random value. While Zhou et al. (2019) perform their experiments on MNIST and CIFAR-10, we test this hypothesis on the WMT 2014 English→German translation task using a deep transformer architecture.
Surprisingly, CLT pruning outperforms SLT pruning on most sparsity levels (see Table 1). By shuffling or re-initializing the remaining parameters, Frankle and Carbin (2018) have already shown that LT pruning does not just learn a sparse topology, but that the actual parameter values are of importance. As the good performance of the CLT experiments indicates that changing the parameter values is of little impact as long as the sign is kept the same, we verify that keeping the original signs is indeed necessary. To this end, we randomly assign signs to the parameters after pruning to 50% sparsity. After training, this model scores 24.6% BLEU and 67.5% TER, a clear performance degradation from the 26.7% BLEU and 65.2% TER given in Table 1. Notably, this differs from the results by Frankle et al. (2020), as their results indicate that the signs alone are not enough to guarantee good performance.
SLT-MP Across all sparsity levels, the combination of SLT pruning and MP outperforms all other pruning techniques. For high sparsity values, SLT-MP models are also superior to the SLT models by Yu et al. (2020), even though they start of from a better performing baseline. We hypothesize that by first discarding 60% of all parameters using SLT pruning, MP is able to fine-tune the model more easily, because the least useful parameters are already removed.
We note that the high weight magnitude for sparse SLT models prevents successful MP training. Therefore, we have to reduce the number of SLT pruning steps by directly pruning to 50% in the first pruning iteration. However, as seen by comparing the scores for 50% and 60% sparsity on SLT and SLT-MP, this does not hurt the SLT performance.
For future work, we suggest trying different sparsity values s i for the switch between SLT and MP.
MP-SLT Switching from MP to SLT pruning causes the models to perform worse than for pure MP or SLT pruning. This indicates that MP cannot be used to find winning lottery tickets.

Conclusion
In conclusion, we have shown that the stabilized lottery ticket (SLT) hypothesis performs similar to magnitude pruning (MP) on the complex transformer architecture up to a sparsity of about 85%. Especially for very high sparsities of 90% or more, MP has proven to perform reasonably well while being easy to implement and having no additional training overhead. We also have successfully verified that even for the transformer architecture, only the signs of the parameters are important when applying the SLT pruning technique. The specific initial parameter values do not significantly influence the training. By combining both SLT pruning and MP, we can improve the sparsity-accuracy tradeoff. In SLT-MP, SLT pruning first discards 60% of all parameters, so MP can focus on fine-tuning the model for maximum accuracy. Finally, we show that MP cannot be used to determine winning lottery tickets.
In future work, we suggest performing a hyperparameter search over possible values for t in SLT pruning (i.e., the number of training steps that are not discarded during model reset), and over s i for the switch from SLT to MP in SLT-MP. We also recommend looking into why CLT pruning works in our setup, while Frankle et al. (2020) present opposing results.